Skip to content
All posts
Video

The Local AI Loop the Cloud Can't Afford

A small box on your desk can now run a whole family of open models at once - and do the one move the cloud quietly will not let you afford: let the models loop on their own outputs. Why raw speed is the wrong scoreboard for local AI.

16 min read

Watch (18:13)


Overview

A small box on your desk can now run a whole family of open models at once - and do the one move the cloud quietly will not let you afford: let the models loop on their own outputs. Why raw speed is the wrong scoreboard for local AI.


Full transcript (from the video)

Here is a shift that almost nobody is talking about correctly. A small box that sits on your desk can now run a whole team of open models at once, not one model, a family of them. One reads and writes code. One looks at images and video.

One listens to audio and talks back. People keep comparing these local setups to the cloud on raw speed, and that is the wrong scoreboard. The cloud is faster at a single sprint, but there is one move the cloud quietly will not let you make, and it is the most useful move of all. On your own machine, you can let the models look at their own outputs, listen to their own output, judge it against the bar, and try again until it passes privately, for free, as many times as it takes.

That loop is the whole story, and by the end of this video, you will understand exactly why it changes what a desktop is for. Start with how most people use a cloud model today. You send a request, you pay for the request, and you get one answer back. If the answer is almost right, you can ask again, but every retry costs money and waiting time, and your data leaves your building on every single call.

So, in practice, people run one pass and accept whatever comes out. There is a second, quieter problem. A normal cloud text model is working blind. When it writes a caption for an image, it never actually looks at the image.

When it scripts a line of narration, it never actually hears how that line sounds. It is guessing from text alone. So, the cloud gives you a fast single shot from a model that cannot check its own work with its own eyes and ears. Hold that thought, because everything good about the local approach is a direct answer to those two limits.

So, what is the box? The current developer version is a compact desktop machine built on a single Grace Blackwell chip. It pairs a 20-core arm processor with a Blackwell graphics engine, and the important number is the memory. It has 128 GB of unified memory, meaning the processor and the graphics engine share one pool instead of copying data back and forth.

That shared pool is exactly what lets you keep several different models resident at the same time. The coding model, the vision model, and the audio model can all sit in memory together, ready to hand work to each other. The chip delivers about a petaflop of AI compute, it is your hardware, none of the work ever leaves the room. Keep that picture in mind, a single shared memory pool with a whole team of models living inside it, because that is what makes the loop we are building toward actually possible.

Meet the team, because the workflow is about getting them to cooperate. The first member is the coding model. Think of it as the hands. It plans changes, writes code, runs it, reads the result, and revises.

The second is the vision model. Think of it as the eyes. It reads screenshots, design mock-ups, and rendered frames, and it can even turn a picture of a layout into working mock-up. The third, and the star of this video, is the Omni model.

Think of it as the ears and the voice. It can listen to audio, understand it, and speak back in real time. There is a fourth piece that is not a model at all. It is the agent framework that acts as the conductor, wiring function calls, tools, and memory together, so the models can pass work between each other.

Four roles: hands, eyes, voice, and a conductor. The magic is not any one of them. It is the loop they form together. Here is the first thing local hardware lets you do that the cloud usually does not.

Because the vision model is sitting right there in memory, you can hand it the actual output and ask it to look. Say you generated an image for a thumbnail. Instead of trusting that it came out right, you pass the rendered picture straight to the vision model and ask concrete questions. Is the subject centered?

Does it match the brief? Is there any garbled or melted text in the frame? The model answers based on what it actually sees, pixel by pixel. Compare that to a cloud text model grading its own image.

It never sees the picture. It only sees the prompt that asked for the picture. So, it is guessing about its own work. On the box, the judge has eyes and once the judge can see, you can trust it to reject bad output instead of shipping it.

That is the first half of the loop. The second thing local hardware unlocks is listening. The Omni model does not just read a script. It can hear the audio that came out of it.

So, picture a narration step. A voice model speaks a line. Normally, you would have no idea if a word came out mangled until a human listened. Here, you hand the produced audio to the Omni model and ask it to transcribe and check.

If a word came out wrong, it catches it and you regenerate just that line. The same model can sit through more than half an hour of audio in a single pass and understand it. So, this scales to a whole video, not just one sentence. This is the listening half of the loop and it is something a text-only cloud model simply cannot do because it never hears anything.

Now, put the two halves together. A judge that can see and a judge that can hear. That combination is where the real workflow begins. Now, we reach the heart of it.

On a metered cloud service, every judge and every retry costs money and time. So, people run the loop exactly once. One generation, maybe one check, then ship. On your own hardware, the math flips completely.

Running the models again costs nothing but a little electricity and a little time. There is no invoice for the 10th attempt. So, you can afford to do something that is wildly impractical in the cloud. You can loop.

Generate a results. Let the vision model look at it and the Omni model listen to it, score it against the bar. If it falls short, send the critique back, repair, and run the judges again. Keep going until it actually passes.

You stop when the work is right, not when the budget runs out. That is the move the cloud quietly cannot afford to let you make at scale, and it is the single biggest reason to run this workflow locally. Let us make the loop concrete because it is just four simple steps. Step one, generate.

A model produces a first draft. It might be an image, a short clip, a line of narration, or a change to some code. Step two, inspect. You take that real output, hand it to the judges.

The vision model looks at the frame. The omnimodel listens to the audio. Step three, judge. The output is scored against a clear bar centered on brief, no garbled text, no mangled words.

Step four, repair. If the work falls short, the critique goes back to the generator with specific notes on what to fix, and the loop runs again. Generate, inspect, judge, repair, and around again. Each lap gets closer to the bar because every lap is free and private, you can run as many as you need.

This little wheel is the entire workflow. Everything else in this video is about making each step fast enough that the wheel spins quickly. Here is a real example, and it is the one running behind this very video. The voice you are hearing is made by a voice model on an Apple silicon laptop.

But the laptop does not get to grade its own work. Each spoken line is sent across to the desktop machine where the omnimodel listens to the actual audio and scores it. Say a line comes back at six out of 10 with a note that one word slurred. That score and that note go back to the laptop, and the laptop speaks the line again.

The desktop listens a second time, and the score climbs. The loop keeps going until a line clears a high bar, say nine out of 10. Only then is the take accepted. One machine generates and the other judges, and they pass the work back and forth over your own network.

There is no cloud and no meter, and nothing ever leaves the two devices on your desk. The star model deserves a closer look because its design is what makes real-time voice possible. The Omni model is built in two cooperating halves, and the team calls them the thinker and the talker. The thinker is the reasoning half.

It understands the incoming text, image, and audio, and works out what to say in plain words. The talker is the speaking half. It takes that meaning and streams it out as natural speech, generating the audio piece by piece, so it can start talking almost immediately instead of waiting for a whole sentence. Splitting the job this way has a quiet benefit.

Because the reasoning happens in text first, you can insert tools, retrieval, or a safety check in between the thinking and the speaking. The model can decide what to say, let your code adjust or approve it, and only then speak. That is a thoughtful design for a real assistant, not just a demo, and it is why this model anchors the workflow. So, who actually runs the loop?

Not the models themselves. The conductor is an agent framework, a thin layer of code that knows how to call each model, pass tools, hold memory, and pull in reference material. Look at the shape on screen. The coding model generates a first draft.

Then a simple loop begins. The vision model judges the frame. The Omni model judges the audio. If both pass, the loop breaks and the work is accepted.

If either fails, their notes are combined into a single critique and handed back to the coding model to repair, and the loop runs again. That is it. The intelligence to see and to hear lies in the models, but the discipline to keep trying until the bar is met lies here in a few lines of ordinary control flow. This is the part people miss.

The loop is not a fancy feature of any single model. It is a pattern you assemble, and local hardware is what makes running it over and over actually affordable. The loop is not just for pictures and sound. The exact same wheel fixes code, and it shows how general the pattern is.

Here the coding model writes a change and then actually runs it. The output of that run, a passing test, a failing test, or a stack trace is the judge. There is no need for a separate critic because reality is the critic. If the tests pass, the loop stops.

If they fail, the error text becomes the critique, and the model reads it, forms a theory, edits the code, and runs it again. Generate, observe, judge. It is the identical shape we just used for media with the test runner playing the role of the eyes and the ears. And again, because this is all local, the model can grind through 10 or 20 attempts on a stubborn bug without you ever watching a meter tick.

The loop is the product. The models are just the parts that fill it in. Let us talk honest performance because the loop only works if each lap is quick. On this desktop chip, a medium open model serves a single user at roughly 30 tokens per second.

That is comfortable reading speed, fine for a person waiting on an answer. The more interesting number shows up under load. When you fire many requests at once, the way an automated loop does, total throughput climbs to over 150 tokens per second because the hardware is sharing its memory bandwidth across all of them. For the voice model, the key fact is even simpler.

It generates speech faster than real time, so the listening step never becomes the bottleneck. The take is produced and judged quicker than it would take to play it. None of these are data center numbers and they are not trying to be. They are exactly fast enough to keep the generate, judge, repair wheel spinning at a pace that feels productive on a machine that fits on your desk.

You might be wondering how a desktop runs models this capable at all. The trick is an architecture called mixture of experts. Instead of running the entire model on every word, the model is split into many specialist sections and a router lights up only the few that each token actually needs. So, a model can be enormous in total knowledge while only doing a small amount of work per token.

One of the leading open coding models is a perfect example. It carries 80 billion parameters of total capacity, but activates only 3 billion of them on any given token. You get the judgment and breadth of a very large model, but it runs at the cost and speed of a small one. That is the secret that lets a whole family of capable models share a single desktop and still respond quickly.

Big brain, small footprint per step. Without this design, the loop simply would not fit or run fast enough on hardware this size. There is one more piece that makes everything fit and it is called quantization. A model is normally stored with fairly heavy numbers for every parameter.

Quantization rewrites those numbers in a much smaller form, a format tuned for this generation of hardware called NVFP4. packs each value into just four bits. The effect is dramatic. A model shrinks by around 70% while the quality stays very close to the original because the format was designed carefully and the chip understands it natively.

Why does this matter for our loop? Because the smaller each model is, the more of the team you can keep loaded in that shared pool at the same time with headroom left over for the work itself. The coding model, the vision model, and the voice model can coexist, and you still have memory for the images and audio they are judging. Quantization is the quiet enabler that turns 128 GB from tight into comfortable.

Let us be clear about what actually fits, so the picture is honest. The desktop box comfortably runs reasoning and agents models up to around 100 billion parameters, which covers the whole family we have been talking about with room to spare. If you need more, two of these units can be linked together over a fast connection to handle models in the 400 billion range. And if you truly need the giants, the ones reaching into the trillions of parameters, there is a larger sibling machine with far more memory built for exactly that.

The point is not that this small box runs everything. The point is that it runs the family you need for this loop today on your desk, and there is a clean path upward when a project genuinely outgrows it. see here and judge workflow, the desktop tier is already more than enough, and most people will never need to leave it. I want to be straight with you about the rough edges, because this is the part that glossy demos skip.

Running the very newest models on this hardware is not always a one-click affair yet. The stock software container often lags behind the latest releases, so you frequently need a nightly build to get current model support. Some of the audio pieces, in particular, have to be compiled from source. And on a machine with shared memory, you have to limit how many build jobs run at once, or you who will run out of room partway through the build.

And because the chip uses an ARM processor, rather than the usual desktop architecture, a handful of packages need small patches before they will install at all. None of this is a deal breaker, and it gets better every month, but it is a real tax in time and patience up front. Budget an afternoon for setup, not 5 minutes. After that, the loop just runs.

Here is why the timing of this video matters. Until now, everything we described lived on a developer desktop, but just before this video, a new chapter was announced. The same class of chip is coming to mainstream thin and light Windows laptops branded for everyday machines and shipping this fall from the major laptop makers. We are talking about a petaflop of AI compute and up to 128 GB of unified memory in a laptop you could carry to a coffee shop.

Think about what that means for the loop, the private workflow we have been building where models see, hear, and judge their own work until it is right stops being an exotic thing for people with special hardware. It becomes a normal capability of a normal laptop. The moment local machines this capable are everywhere, running models in the cloud for everyday work starts to look like the exception, not the default. That shift is just getting started.

Let us keep this honest in both directions because claiming too much would be a disservice. The cloud still wins at real things. It hosts the single largest frontier models, the ones too big to fit on any desk. For one heavy request against one of those giants, a data center will still answer faster than your laptop can.

If raw peak power on the biggest possible model is what you need, the cloud is the right tool and that is fine. The local edge is simply a different edge. It is privacy because nothing leaves the machine. It is no metering because you can loop as many times as the work demands without watching a bill.

And it is judgement with real eyes and ears because the models can see and hear their own output. So, the takeaway is not that local replaces the cloud. It is that for an iterative multimodal workflow that keeps your data private, local now wins exactly where it counts. So, here is the whole idea in one breath.

The reason to run this family of models on your own machine is not that it is faster than the cloud because for a single sprint, it often is not. The reason is the loop because the models can see and hear their own output and because running them locally is private and unmetered, you can generate, judge, repair, and try again until the work is genuinely right instead of accepting the first shot a meter let you afford. That is a different way of working and it produces better results. The hardware to do it already sits on developer desks today and the very same capability is about to land in ordinary laptops.

So, the move is simple. Stop thinking in single shots and start designing your work as a loop that checks itself with its own eyes and ears, then let it run until it passes. Your desk can do that now. Soon, so can F