Skip to content
All posts
Video

Run AI On Your Laptop — Zero Cloud Bills (Local Small Models)

Local small models are the right default for narrow, high-volume AI work: classification, extraction, routing, and rewrites that should not be billed forever. Run them through Ollama, MLX, or vLLM, fine-tune with LoRA when prompting drifts, and escalate only the hard cases to the cloud.

12 min read

Watch (12:13)


Overview

Local small models are the right default for narrow, high-volume AI work: classification, extraction, routing, and rewrites that should not be billed forever. Run them through Ollama, MLX, or vLLM, fine-tune with LoRA when prompting drifts, and escalate only the hard cases to the cloud.


Full transcript (from the video)

Here is a number that should bother anyone building with language models. Your cloud bill is not a fixed cost. It is a meter. Every classification, every little summary is another charge and it never stops.

The strange part is that most of that work is not hard. It is narrow. It repeats and it almost never needs the smartest model in the world. Meanwhile, the most powerful computer you own is sitting on your desk doing nothing.

This is a talk about closing that gap. We will run small models locally. Teach them your exact task. Wrap them in clean, reusable prompts and only call the cloud when the work genuinely deserves it.

The goal is not to be clever. The goal is to move the boring high volume work off the meter and onto hardware you already paid for without giving up quality and without shipping your data to a stranger. When people picture their model spend, they picture the big impressive answers like a long essay or a hard piece of code. But that is rarely where the money goes.

The money goes into the quiet work that runs constantly in the background. Sorting a support message into a category. Pulling three fields out of a documents. Deciding which tool to call next.

Rewriting a sentence so it reads cleanly. Each one is cheap on its own, almost free, but you run them thousands of times a day and you run them on a frontier model because that was the easy deep. So you are paying premium rates for workup pay. Much smaller model could do in its sleep.

The first shift in thinking is simple. Separate the rare hard request from the constant easy one. The hard request can stay in the cloud. The constant easy one is your opportunity.

Let's define the thing. A small language model, sometimes called an SLM, is just a model with far fewer parameters than the giant ones in the cloud. Today, that usually means 1 to 8 billion parameters. The giants are hundreds of billions.

That size difference is the whole story. A smaller model holds less of the world in its head, so it's weaker at open-ended reasoning and obscure trivia. But in exchange, it fits in the memory of a normal laptop. It answers in real time.

And once it's on your machine, it costs you nothing per call. The mistake is judging it by what it can't do. The right question is narrower. Can it do your one specific job well enough for a huge share of real tasks?

The honest answer is yes, especially once you've taught it. This was not really possible on a laptop a few years ago, and three things changed that. The first is unified memory. On Apple silicon, the processor and the graph cores share one big pool of memory.

A well equipped laptop can devote most of that memory to holding a model with no separate graphics card required. The second change is software. Apple's framework called MLX grew up fast and it is now the quickest way to run and train models on a Mac. The popular tools have quietly moved onto it.

The third is quantization, which is a way to compress a model's numbers, so it takes about four times less space while staying nearly as accurate. Put those together and the picture flips. The laptop is no longer a toy client that calls a real computer somewhere else. Your laptop is the data center.

The serious work can happen right where you are sitting. So, how do you actually run one of these models? You have a few good options and they suit different moments. The friendliest place to start is Alama.

You install it. You pull a model with one command and you are talking to it a minute later. It recently moved its Apple backend onto the fast path so beginners now get speed for free. When you are ready to go deeper on a Mac, reach for MLXM, which is Apple's own.

It is the quickest option on Apple hardware. And crucially, it is also where you will do your finetuning later, so it pulls double duty. Then there is VLL. That one is built for throughput.

It is what you run on a shared server when many tea requests arrive at once and you need to serve them efficiently. A simple rule of thumb, start with a llama, grow into MLX for Macwork, and move to VLLM when you are serving a crowd. Here is the detail that makes all of this practical instead of painful. Every one of those local engines can speak the same request formats that the big cloud providers use.

Almost the Mac toolkit can and the server engine can too. That sounds like a small technical footnote, but it changes how you build. It means your code does not need to know or care whether the model lives on your laptop or in a data center across the country. You point your application at one address to use the local model and at another address to use the cloud and nothing else changes.

No rewrite, no no second version of your logic. That single shared interface is what lets you mix and match with confidence. You can develop against a free local model, ship the exact same code and decide per request where the work should actually run. This is the heart of the savings and it is a pattern, not a trick.

You set the local model as the default. Every request goes there first. For the routine majority, the local model answers and that answer cost you nothing. The clever part is knowing when to reach for help.

You define a simple test for difficulty. Maybe the model reports low confidence. Maybe the input is unusually long or messy. Maybe a quick check on the output fails when that test trips and only then you escalate that single request to a frontier model in the cloud.

So the cloud becomes the specialist you consult on hard cases, not the laborer you pay for everything. done well, the vast majority of traffic stays on your hardware and your cloud bill shrinks to a thin slice of genuinely difficult requests. You did not sacrifice quality. You just stopped overpaying for the easy 90%.

Let us make the money concrete because the shape of the cost is what matters. Cloud usage is a recurring expense. It scales directly with how much you use it and it continues for as long as your product lives. Local usage has the opposite shape.

You pay once for the hardware and after that each request is effectively free. There is a but it rounds to nothing next to per request pricing. So the comparison is not really cloud versus local on a single call. It is a meter that runs forever against the fixed cost you already covered.

Once your volume is meaningful, the fixed cost wins and it wins by a wide margin. The laptop you bought for development quietly becomes the cheapest inference server you will ever operate. The break even points arrives much sooner than most teams assume, often within the first month of serious traffic. Now the honest catch.

If you download a small model and just throw your real task at it, you will probably be underwhelmed. Out of the box, it is a generalist. It knows a little about everything and nothing about your world, you can push it a long way with good prompting, and you should, but prompting has a ceiling. You can describe your formats, give example, and beg for consistency.

Yet, the model keeps drifting. Missing edge cases or answering in a slightly wrong shape. This is the moment where most people give up and conclude that small models are not ready. They are wrong.

They are just using the model as a generalist when the task calls for a specialist. The fix is not a bigger model or a longer prompt. The fix is to teach this small model your job directly so that your task stops being something it six guesses at and becomes something it knows. Fine-tuning sounds intimidating and expensive and it used to be.

The modern version is neither and the idea is simple. You show the model many examples of your task done correctly and it adjusts so that your task becomes second nature. The efficient way to do this is called low rank adaptation, usually shortened to Laura. Retraining the whole model would be huge and slow.

So instead, you freeze the original weights and train a small ad on layer that nudges the model toward your job. That add-on is tiny, often just a few megabytes, and you can keep several of them for different tasks and swap between them. Here is the payoff that surprises people. on one narrow, well-defined task.

A small model that you fine-tuned can match or beat a Frontier model that you only prompted. It runs faster and stays private, and it costs you almost nothing to operate. Best of all, it is yours. Small, sharp, and yours.

If you take one practical thing from this section, take this. Fine-tuning is mostly a data project, not a machine learning project. The model is the easy part. Now, the work is assembling good examples.

An example is just a pair. the input your system will really see and the output you genuinely want back. You collect those pairs from your actual logs, your past work or a careful handbuilt set. You do not need a mountain of them.

A few hundred clean consistent examples often outperform a million noisy because the model is learning a pattern and noise teaches it the wrong pattern. So spend your effort on consistency. Decide exactly what a correct answer looks like. Format it the same way every time and fix the contradictions.

If your examples disagree with each other, the model learns to be confused. If they agree, the model learns your job. The data set is the product, so treat it that way. Let me walk the actual loop because it is shorter than you would guess.

You install the Mac toolkit with a single package. You prepare your examples in a simple text format and point the trainer at them. Then you run the training step, which builds that small Laura adapter we talked about, while leaving the base model untouched. When it finishes, you test the adapter by generating a few outputs and checking them against your expectations.

If it looks good, you fuse the adapter back into the base model, which produces a single self-contained fine-tuned model that is ready to deploy. That whole sequence has only four steps. And the remarkable part is where this happens. Not on a rented cluster, not on someone else's hardware, on the same laptop you are using right now.

in a matter of hours for a small model. You can start a tuning run after lunch and be tier serving your own specialized model before you go home. Once you are running models for real, your prompts become assets and they deserve better than being glued inside your code as messy strings. This is where templating earns its place and a great fit is liquid.

Liquid is the templating language that powers Shopify themes and the same language Azure uses to transform messages in its automation tools. It was designed to be safe and simple so that non-engineers could edit templates without running arbitrary code. That safety is exactly what you want for prompts. You write a prompt as a template with placeholders, then fill it with your data at runtime.

A loop can drop in a handful of examples for the model to follow. A conditional can include extra context only when it is relevant, keeping the prompt short otherwise. The result is that your prompts become real reviewable files. A teammate who does not write code can read them, suggest changes, and improve your model's behavior without touching the program around it.

Now you have the pieces. A fast local model, a fine-tuned specialist, and clean templated prompts. The last job is to connect them into a workflow. And this is where orchestration tools come in.

Langchain gives you the connectors, the standard ways to talk to models, tools, and data, including your local model through that shared interface we covered earlier. Lang graph sits on top and lets you describe your process as a graph of steps with memory between them. That structure is perfect for the pattern. At the center of this talk, the small first router becomes a single clearly defined node.

It tries the local model first and it escalates to the cloud only when a check on the results says it should. It can retry on failure or reach for a tool when the task demands one. Because it is a graph, you can see it test each piece and trust what it does. Your cost-saving strategy stops being scattered logic and it becomes one diagram.

You can point at and explain. A fair talk admits its limits. So, let me draw the line clearly. Local small models are not a replacement for everything and pretending otherwise will burn you.

The tasks that need an enormous amount of context at once. The very latest coding ability for those frontier models in the cloud are still worth every cent. Use them without guilt. The point was never to ban the cloud.

The point is to stop sending it work it doesn't deserve. Local shines on the high volume, well-defined, repetitive tasks that quietly dominate your usage. Send those to your laptop. Send the genuinely hard cases to the cloud.

And while you're at it, two bonuses come for free. Your data never leaves the building, which makes a lot of privacy questions simply disappear. And your system keeps working even when the network does not. Right tool, right job.

So where do you begin? Not with a grand migration, but with one single task. Pick a narrow job that you run constantly. the kind of quiet high volume work we started with.

Install a runtime, the friendly one is fine, and try a small model on that task. Measure it honestly against what the cloud gives you. If the small model is already good enough, you just removed a line item from your bill. If it is close but not quite there, gather a few hundred examples and fine-tune it until it is.

Move the prompt into a clean template so it is easy to improve. Then wire in the small first router. So this the cloud only sees the hard cases. That is the whole method.

And you can do the first pass this week. The most powerful computer you own has been waiting for real work. Give it a job.