Skip to content
All posts
Video

Headroom: Your AI Agent Is Wasting Money On Tokens

The most expensive part of an AI agent is usually not the answer - it is everything the agent reads on the way there. Headroom compresses that context before the request leaves your machine, keeps the original nearby, and lets the model ask for the full detail only when it needs it.

10 min read

Watch (10:47)


Overview

The most expensive part of an AI agent is usually not the answer - it is everything the agent reads on the way there. Headroom compresses that context before the request leaves your machine, keeps the original nearby, and lets the model ask for the full detail only when it needs it.


Full transcript (from the video)

The most expensive part of an AI agent is often not the answer. It is everything the agent reads on the way to the answer. A search tool returns a wall of results. A build tool returns pages of logs.

A retrieval system hands over chunks that mostly repeat each other. Then the model gets built for reading all of it, even if the useful clue was one line near the bottom. Headroom is interesting because it attacks that build before the request leaves your machine. It compresses what the agent is about to send, keeps the original nearby, and gives the model a way to ask for more, only when it truly needs more.

That is the hook for this video. We are not talking about shorter prompts as a productivity tip. We are talking about changing the economics of agent software. If you build with language models, this is the difference between a demo that feels cheap and a production loop that quietly burns money every hour.

The mistake is thinking the prompt is the whole cost center. In a real agent, the prompt is usually tiny compared with the context that surrounds it. The agent asks a tool for search results. So the model reads every title and snippets.

It asks a shell command what failed. So the model reads the quiet lines before the error and the quiet lines after it. It asks a retrieval system for background. So the model reads five chunks that all explain the same idea in slightly different language.

This is not useless data. The agent needs a broad view at first. The waste appears because the model receives the broad view in its raw form every turn at full price. Headroom sits in that gap.

It says the system should keep the raw data locally, but the model should first see a C compact version that preserves the shape, the outlier, the errors, and the parts most likely to matter. Headroom's vet is simple and aggressive. Do not wait until the context window is full and a torn. Then ask the model to summarize its own mess.

Compress the incoming material before it ever reaches the provider. That sounds obvious until you hit the hard part. Different content fails in different ways. If you crush source code like pros, you lose signatures and imports.

If you summarize a JSON array like an essay, you lose the one row that matters. If you trim logs from the top, you may delete the first real error. Headroom works because it treats compression as routing, not as one universal summary button. It detects the content type, chooses a compressor built for that shape, and then stores the original soaper.

The model can retrieve it later. The best mental model is a local pressure valve. The provider sees a smaller request while your machine keeps the full evidence. The default path is short enough to keep in your head.

First cash aligner tries to make the front of the request stable. Provider prompt caches are picky. If the prefix changes by a little, the cache miss can be expensive. So headroom moves volatile material away from the part that should stay identical.

Next content router looks at each message and decides what kind of content it is dealing with. Structured data and source code take different routes, while plain text can use a modelbased compressor when that option is installed. The reversible layer stores the original content and adds retrieval instructions when compressed markers appear. That last piece is why the design is more credible than a normal summary.

The model is not forced to call trust the compressed view forever. It starts with the cheap U then asks for the full evidence only when the task demands it. This is the part that makes headroom more than a nicer truncation function. The router does not ask how do I make this smaller in the abstract.

It asks what kind of thing am I looking at? A huge JSON response usually has repeated keys, repeated shapes, and a small number of unusual rows. Smart Crusher can factor out constants, sample boring reg, and keep the weird points that are more likely to carry the answer. Source code has a different failure mode.

The model needs the function names, the call shape, the types, and the imports because those are the handles it reasons over. Logs have another pattern. You want the first failure, the rare line, and the surrounding context, not 100 lines of normal startup noise. Good token efficiency is not just fewer tokens.

It is fewer low-v valueue tokens while the loadbearing tokens stay visible. Aggressive compression without recovery is a gamble. It saves tokens right up until the missing detail was the entire answer. Headroom's safety story is CCR, which stands for compress cache retrieve.

When content is compressed, the original is stored locally and the compressed output carries a reference. If the model can answer from the compact view, the system wins immediately. If the model needs more, it can use the headroom retrieval tool to ask for the original contents or to search inside it. The user does not have to manually paste a huge payload back into the chats.

The proxy can handle that loop automatically. This changes the trade-off. You can compress hard because the full data is not gone. The model starts with the cheap map, then opens the expensive drawer only when it has a reason.

That is the difference between token saving as a hack and token saving as an architecture. The fastest way to try Headroom is the proxy path. You install the package, start the local proxy, and point your client at the proxy instead of pointing it directly at the provider. That is the zero code version of the story.

Your app still thinks it is calling a normal model endpoint. Headroom sees the request first, runs the compression pipeline, forwards the optimized request, and keeps track of what it saved. This is especially useful when you have an existing agent that already works and you do not want to thread it in new library through every call site just to test the economics. Run it on a branch.

Send a representative workload through it. Watch the stats end point. The number you care about is not a single heroic demo. You care about repeated savings across the boring calls because those are the calls that become your monthly bill.

If you own the code, the library path is cleaner. You call the compression function right before the model request. Then pass the compressed messages into your existing clients. That gives you a very direct control surface.

For a coding agent, the defaults protect recent turns and avoid compressing the active user message. for a document system or a retrieval pipeline. You can choose to compress more of a user supplied content and keep a larger fraction when accuracy matters more than cost. The key implementation habit is to measure before you celebrate log tokens before tokens after transforms and answer quality on the same workload.

Then split traffic. One group calls the provider normal, one group calls through headroom. If the answers stay stable and the token curve drops, you have a real optimization. If not, you tighten the settings or narrow headroom only the content type where it helps.

For agent workflows, the wrapper commands are the shortcut. Instead of redesigning the agent, you launch the agent through headroom. The project advertises rappers for tools like claw code, codecs, cursor, ader. It also exposes a model context protocol server with tools for compression, retrieval, and stack.

The more interesting idea is shared memory across agents. A common failure in daily work is paying each agent to rediscover the same repository facts. One tool scans the codebase in the morning. Another tool repeats that scan after lunch and the bill counts both.

Headroom's memory features are meant to make that context portable and dduplicated. That matters when you switch between assistants during the same project. You are not just saving tokens inside one request. You are reducing repeated reading across the whole working day.

The money math is boring which is exactly why it matters. Input cost equals tokens sent multiplied by provider price multiplied by how many times the loop runs. A single bloated tool results might not feel painful. Put it inside a daily coding agent, a customer support triage job or a retrieval system that handles many requests and the same waste becomes a line item.

Headroom's readme shows large reductions on real agent style workloads, including code search and incident debuting. Do not copy those numbers into your forecast blindly. Treat them as a reason to measure your own workload. The right question is where do we repeatedly send large context that has structure and duplication?

that is the high value target. Tiny chat turns are not worth optimizing first. Big tool outputs, repeated retrieval chunks, and codebased exploration are where token efficiency turns into actual cost reduction. There is a second kind of savings that is easy to miss.

Many providers discount or speed up repeated prompt prefixes, but only when the bites line up. Agents are bad at that. They insert dates, request identifiers, chin volatile history near the front of the prompt. A tiny difference can turn a cached prefix into a full price read.

Headroom includes cache alignment so the f stable part of the request has a better chance to stay stable. The proxy also has modes that let you choose between maximum immediate token reduction and stronger cash stability across longer conversations. That choice matters. If your workload is one shot analysis of a giant payload, token mode may be the better fit.

If your workload is a long agent session with many repeated turns, cash stability can become the larger win. The point is to optimize the bill you actually have, not the benchmark you wish you had. Compression is the first layer. Memory is where the savings become a habit.

Picture an agent that learns your system. Project keeps migrations in one place. It learns your test runner. It learns the pattern your team keeps rejecting.

That knowledge should not disappear when you open a different assistance. Headroom's shared context tools are built for that handoff. Agents can pass compact context through a shared store instead of reading the same files again and again. The failure learning command goes further.

It studies failed sessions, then writes durable corrections into the fates, guidance files your agents already read. That is not only a cost-win, it is a quality win. You spend fewer tokens rediscovering the obvious, and the next session is less tough, likely to repeat the same mistake. For teams, privacy matters, too.

The shared store stays local to your environment. So, reusable context does not have to become provider memory. But boring warning is important. >> >> Do not run compression everywhere just because the percentage looks good.

Tiny prompts do not need it. Highly sensitive legal review, financial review, or medical review may need exact text reserve. You have a careful retrieval and audit path. A task that asks the model to compare wording line by line is different from a task that asks it to triage a long log.

The right metric is not the compression percentage. The right metric is stable answer quality at lower cost. Retrieval calls are also a signal. If the model constantly has to retrieve the original, your compact view may be too thin or the fake wrong content type may be compressed too aggressively.

Start narrow by compressing tool outputs first, protecting recent messages and logging what changes. Then expand only where the evidence says it helps. Token efficiency is engineering, not a magic coupe. Here's the practical way to implement it without turning your app into an experiment no one trusts.

Pick one boundary first. If you want speed, start with the proxy. If you own the product code, call the library before the model request. If your pain is daily coding agents, start with the wrapper.

Measure the baseline before you change anything because the best story is tokens saved while task success stays flat. Then target the noisy surfaces first. Tool output logs, retrieval chunks, and code search usually have structure, repetition, and enough volume to matter. Keep recovery visible.

If retrieval frequency spikes, that is not failure, but it is feedback that the compact view needs tuning. The final rule is to avoid chasing the smallest prompt and chase the cheapest reliable answer instead. Headroom is compelling because it gives you a local layer for doing exactly that.