Skip to content
All posts
Video

What Actually Fits on 128 GB (Quantization Explained)

Quantization is the single most important lever for fitting large models onto hardware you actually own. What a number really is inside a model, how quantization shrinks it, and what genuinely fits in 128 GB of unified memory.

13 min read

Watch (15:47)


Overview

Quantization is the single most important lever for fitting large models onto hardware you actually own. What a number really is inside a model, how quantization shrinks it, and what genuinely fits in 128 GB of unified memory.


Full transcript (from the video)

A year ago, running a frontier scale language model meant a rack of data center accelerators. Today, a single quiet box on your desk, 128 GB of unified memory, and you can load a model with more than 100 billion parameters, chat with it locally with nothing. But whether a model loads at all or runs fast enough to matter comes down to one technique. It's called quantization.

and it is the single most important lever you have for fitting large models onto hardware you actually own. In this video, we build it up from the ground. What a number really is inside a model, how quantization shrinks it, why the naive version breaks, and exactly which models fit on a 128 GB machine. Let's start with the problem.

The math is almost embarrassingly simple. Take the number of parameters, multiply by the bytes you spend on each one. That's it. At full precision, every weight costs four bytes.

Half precision, the common default, cuts that to two. So, a 7 billion parameter model at half precision lands around 14 GB. Already enough to strain a 24 GB graphics card once you add room for the running context. Step up to 70 billion parameters and you're looking at roughly 140 GB, far beyond any single consumer card.

The largest open models would need hundreds more. Full precision is the honest baseline, but it's almost never what you actually run. The whole game is spending fewer bytes on each weight without breaking the model. To see how that's possible, we need to look at what a single weight really is.

Every weight is a single number. How that number is stored decides everything that follows. The standard full precision format uses 32 bits split into three parts. a sign bit, then eight exponent bits, then 23 bits for the fraction.

Here is the key idea. The exponent bits by you range, how large or how tiny a value can be. The fraction bits buy you precision, how finally you can tell two nearby values apart. When you drop to a 16- bit format, you have to give something up.

The older half precision format keeps more precision but loses range, which can make training unstable. The format that won is called brain float or BF-16. It keeps the full range of the 32bit format but trades away precision. It holds only about seven meaningful bits.

That trade range against precision is the seed of the whole quantization story. Now let us push the bits down on purpose. Quantization is at its heart walking down a ladder of formats, spending fewer bytes at each rung. Full precision costs four bytes per weight.

16 bit formats cost two 8-bit formats. Whether floating point or integer cost a single bite and the aggressive 4-bit formats half a bite. You literally pack two weights into the space of one. Each step down roughly has the memory the model needs.

Going from the 16- bit baseline all the way to four bits makes the model four times smaller. That 7 billion model that needed 14 GB now fits in around 3 and 1/2. The 70 billion model that needed 140 drops to the low 40s. This is the lever.

The only question, and it's the whole rest of the video, is how you take those bits away without the model falling apart. So, how do you turn a precise number into a tiny one? Here's the trick. You divide the real value by a step size called the scale.

Round to the nearest whole number and add a small integer offset. That's it. What you actually store is just that small integer. To use the weight again, you run the map backward.

Multiply the integer by the scale. Subtract the offset and you get something close to the original. Close. Not exact.

The difference between the original and the reconstructed value is the quantization error. and minimizing that error across billions of weights is the entire craft. Pick the scale and offset well and the error is tiny, harmless. Pick them badly and the model degrades, which raises the obvious question, how do you choose them and how finely.

There are really only a few knobs. The first is whether your integer grid is centered on zero. Weights tend to sit symmetrically around zero. So you can fix the offset at zero and save an operation that is symmetric quantization activations after some layers are so you shift the grid to match them that is asymmetric.

The second knob and the more important one is how finely you slice. The crudest choice is one scale for an entire weight matrix. It is cheap but a single scale has to cover every value in millions of weights. So it wastes precision.

The fix is to slice the matrix into small blocks and give each block its own scale that costs a few percent in overhead and buys a large jump in accuracy. Block level scaling is the standard for every serious 4-bit method. But even careful slicing runs into one nasty problem. Here is the villain of the story, outliers.

Once a model grows past a few billion parameters, a tiny handful of dimensions start producing values of enormous magnitude far larger than everything around them. They are rare, only about one feature in a thousand, but they are not noise. Remove them and the model's quality collapses. Remember that a scale has to stretch to cover the largest value in its group.

When one value is 100 times bigger than its neighbors, the scale has to balloon just to reach it. Now every ordinary weight is rounded with almost none of the resolution it needs. A single monster value quietly destroys the precision of all the others. Naive uniform quantization has no answer for this.

The clever methods are all at heart. Different ways of taming these outliers. Picture this. Activations flowing through a layer visualized as a grid.

Brightness standing in for magnitude. Almost the whole grid is calm. Small values, gently varying, peaceful, but a few columns are blind towering over everything else. Those are the outlier dimensions.

Now try covering that entire grid with one brightness scale. The scale has to reach those brightest cells, which means every column value gets squashed into the bottom of the range, indistinguishable from one another. That is exactly what a uniform quantizer does. Once you see that picture, every method that follows clicks into place.

Each one is a different strategy for handling those few searing columns without sacrificing the calm majority. Let's look at the four that matter. Four families, one enemy, each one built around a single clever idea. The first refuses to compress the troublemakers at all.

It keeps rare outlier dimensions in high precision, quantizes everything else to eight bits, then stitches the two paths back together. The second walks the weights and asks how sensitive each one is. It adjusts the surviving values to cancel out the error introduced, fitting each small group as tightly as possible. The third asks a different question.

Which weights do the activations actually lean on? It finds the roughly 1% that matter most, then protects them while compressing the rest. The fourth changes the shape of the number system itself, placing its 16 available levels closer together near zero, where most weights live. Different philosophies, same enemy.

Which one you reach for depends mostly on where you run. Think of it as a ladder and you're picking a rung. At the top sits the 8-bit level, effectively indistinguishable from the original, but it barely saves you anything. Drop to 6bit or 5bit and the quality is still near perfect for almost any use.

Then comes the rung most people actually land on 4bit. It keeps roughly 97 to 99% of the original quality and cuts the size by about 3/4. Below that, the curve bends sharply. 3bit shows real noticeable degradation.

Two bit is an emergency measure for when a model simply won't fit any other way. The shape of that curve is the key insight. Flat for a long time then a cliff. That's why 4bit is the default.

Let's see why it works. It is not actually four bits everywhere. That is the secret. Inside a single file, it mixes precision storing certain attention and feed forward at six bits while keeping everything else at four.

The method knows which weights each layer is sensitive to and it spends the extra bits exactly there. the bare minimum everywhere else. The result holds on to almost all of that quality for almost none of the extra cost. Same lesson as groupwise scaling taken one level higher.

Precision is a budget. Spend it where it counts. But one more thing, the same bit budget can ship in very different file formats. And the format you pick decides where a model can actually run.

They don't just differ in cleverness. They live in different worlds. and matching the method to your goal saves a lot of pain. The single file local format is the friendliest.

One file runs across processor and graphics card together and it's what the popular desktop runners use out of the box. The server oriented formats are built for a different job, squeezing maximum throughput out of a graphics card when you're serving many requests at once. Among those, the activationawaware approach is popular because its algorithm is simple and it does not overfit to its calibration data, often matching the more elaborate option. And the distribution-shaped 4-bit format has a special role.

It's the one designed for fine-tuning a big model cheaply and loading it quickly. So, running locally, reach for the single file format. Serving on a card, reach for the server formats. Now, for the moment of truth.

This is the question that started us all off, you have roughly 115 GB of usable memory. Once the system and running context take their share. So, what fits a small 8 billion model is trivial. Full precision, room to spare.

The 70 billion model that couldn't touch any consumer card at full precision drops to the low 40s at 4 bits and runs comfortably, leaving most of your memory free. Then it gets interesting. A 235 billion mixture model won't fit at four bits. It needs around 140 gigabytes.

But drop it to the two and threebit rungs and it slides in comfortably at two bits right up against the ceiling at three. The very largest, the 600 billion class models simply never fit. Even flattened to two bits, they're well over 200 GB. The lever has limits, but they are astonishingly high.

One model breaks the pattern entirely. A recent open model around 117 billion parameters doesn't ship at full precision at all. 8bit would land at roughly 124 GB. That's over the line.

It won't load. So instead, it ships natively in 4bit at around 63 GB, 50 gigs to spare. Sit with that. The 4-bit isn't a compromise you apply after the fact.

It's what the creators chose as the release format. Quantization has crossed over from an afterthought to the native form of the model itself. And because it's a mixture of experts design, only a small slice of it activates on any given token, which sets up the next surprise. Size and speed are not the same thing.

Here is the part people miss. On a unified memory box, the limit on generating text is not raw computing power. It is bandwidth. To produce each new word, the machine has to stream the active weights out of memory.

And that memory delivers a few hundred gigabytes per second. So your speed ceiling is roughly that bandwidth divided by how many bytes you move per word. Now look at what quantization actually does. It does not just make the model smaller on disk.

It makes every weight fewer bytes to stream. Have the bytes per weight. You roughly double the words per second. That is the hidden second payoff.

And here is what makes this elegance. The same shrinkage that makes the model fit also makes it run faster. Bit and speed are the same lever pulled once. But total size can still fool you.

And the next comparison shows exactly how. Bigger doesn't mean slower. That assumption will cost you. Take a 70 billion dense model.

At 8 bits, every single weight has to be streamed for every word. And on this class of box, it crawls at under three words per second. Technically working. painful.

Now take that 117 billion model from a moment ago. The 4-bit mixture design. On paper, it is larger. In practice, it produces around 40 words per second, more than 10 times faster.

What changed? A mixture model only activates a small slice of itself for any given word. A few billion parameters, not the full stack. So, what sets your speed is not the total parameter count.

It is how many bytes you actually move per word. When you choose a model, read two numbers. Total size decides whether it fits. Active size sets how fast it runs.

So, how do you actually choose? Start with four bits. For the large majority of uses, it gives you the best quality for every bite you spend. It's your default full stop.

If you have memory to spare, step up to five or six. The quality is essentially perfect, and you're not hurting for room. But if you're right at the edge, resist the urge to crush a giant model down to two bits. A smaller model held at six will almost always beat a larger one mangled to two.

And for a mixture model, check both numbers. Total size for whether it fits. Active size for how fast it runs. On a 128 GB box, the comfortable home is a 30 to 70 billion model quantized to four, five, or six bits.

That payoff is one command away. None of this stays absubracted for long. The tools are one command away. With a popular desktop runner, you pull a 4-bit model and start chatting in a single line.

The runner downloads the quantized weights and loads them. No wrangling required. Got a full precision model of your own? One command converts it down to the 4bit sweet spot.

The tool even reports the exact bits per weight and how much smaller the file became. You can inspect all of that before you ever commit it to disk. And the payoff we've been building toward, it's real. That 100 billion parameter model in its native 4-bit form pulls down and runs interactively on a single desktop box.

A model that 2 years ago lived only in a data center now answers you from the machine under your desk. That's what quantization buys. Here's the whole arc in one breath. A weight is just a number.

Quantization shrinks that number with a simple map. Divide by a scale, round, and offset. The hard part out a few enormous values wreck a naive scale and every method that matters is a different way of tamming them. Spend your bits where they count the way the 4-bit sweet spot does and you keep almost all the quality at a quarter of the size.

On a 128 GB box, that means a small model is trivial, a 70 billion model is comfortable, and a Frontier scale model sits right at the edge of possible. And the same shrink that makes a model fit is what makes it fast. Fewer bits, bigger models, running on your own desk. That's the lever.