Skip to content
All posts
Video

How AI Generates Text vs Images

Text and image generation look like the same trick from the outside, but under the hood they are two completely different machines: a language model that writes one token at a time like a typewriter, and an image model that sculpts a picture out of noise. Different math, different training, different failure modes.

16 min read

Watch (17:28)


Overview

Text and image generation look like the same trick from the outside, but under the hood they are two completely different machines: a language model that writes one token at a time like a typewriter, and an image model that sculpts a picture out of noise. Different math, different training, different failure modes.


Full transcript (from the video)

Type a sentence into a phone chat box and you get an essay back. Type a different sentence and you get a painting. From the outside, it looks like the same intelligence doing two flavors of the same trick. It is not.

Under the hood, generating text and generating images are two completely different machines, different math, different training data, and they even fail in completely different ways. Those are not minor distinctions. In the next 20 minutes, we are going to open both machines. A language model writes one word piece at a time like a typewriter that predicts its own next key.

An image model works nothing like that. It starts from a screen full of television static and sculpts a C photograph out of the noise. Those are two very different processes. By the end, you will know exactly why text models historically could not spell and why image models cannot draw.

You will also see why the two designs are now borrowing each other's best ideas. Here is the biggest misconception to clear up first. When a chat assistant hands you a picture, the model you were chatting with almost never painted it. What actually happens is a handoff.

The language model takes your request, rewrites it into a detailed caption, and passes that caption to a completely separate image model. The image model has never read your conversation. It only sees the caption. Then the result drops back into the chat as if one mind did everything.

So the right mental model is a director briefing a painter. The director is fluent in language, understands your intent, and writes very precise instructions. The painter does not The painter only knows how to turn instructions into pictures. Two specialists, one seamless interface.

Keep that split in mind because everything that follows is about how differently those two specialists actually work. Forget letters, forget words. A language model sees tokens and nothing else. A token is a chunk of text on average three or four characters long.

Common words like the or and get a token of their own. Rarer words get chopped into pieces. The word strawberry, for example, is often split into several chunks like straw and berry or even stranger fragments. The model's entire universe is a fixed vocabulary, typically somewhere between 50,000 and 200,000 of these chunks.

Every prompt you type gets translated into a sequence of vocabulary entries before the model ever touches it. This explains one of the most famous failure modes. When people ask a model how many letter Rs are in strawberry, it often stumbles. It is not stupid.

It literally cannot see the letters. It sees two or three opaque chunks and it has to remember from training what letters those chunks happen to contain. Asking it to count letters is like asking you to count the threads in a shirt you are cyst only allowed to see from across the room. Watch a model respond in real time and you see something telling.

Words appear one at a time, not in a burst. That is not a loading animation. You are watching the loop itself. Here is how it works.

The model takes everything written so far and runs one pass through the network. That pass scores every token in the vocabulary. Not a sentence, not a paragraph, just a giant scoreboard. The next chunk is probably this.

Maybe that almost certainly not those. One token gets picked from the top of that scoreboard. It gets added to the end. Then the whole thing runs again with the slightly longer text as input.

That is the whole trick. A 500word answer means roughly 700 separate predict. Each one is a full pass through the network. Each one sees everything that came before it and that last point matters.

Every token is a fresh decision. There is no going back to fix an earlier one. That is why a model can start a sentence well and lose the thread by the end. The loop has no undo.

So what actually happens inside that single prediction. The architecture is called a transformer and its core trick is something called attention. Here is what attention does. Take the sentence the keys to the cabinet are so I cannot.

every position in that sentence looks at every other position and decides which words matter most. The word at the end needs to figure out what comes next. Attention lets it lean heavily on keys and cabinets and lightly on everything else. That is how the model tracks what is relevant.

This does not happen in one pass. It happens in dozens of stacked layers. Early layers settle grammar and syntax. Middle layers assemble facts and relationships.

Late layers converge on the actual next token. Each layer takes the previous layers rough guess and sharpens it. Now, here is the part that surprises most people. There is no D-Day boats inside.

Nothing is ever looked up. Every fact the model appears to know is smeared across billions of numerical weights adjusted during knowledge is not stored the way files are stored on a disk. It is stored the way aq habit is stored in your muscles. The pattern is baked in, not biled away.

Deterministic means the same prompt gives you the same answer every time. That sounds useful, but it also means safe, predictable, and dull because the highest scoring token is almost always the most boring one. So, the model doesn't just take the top pick. It rolls wheat dice and you get two dials to shape those dice before they fall.

The first is temp it down and the model clings to the safe words. Precise, repetitive, turn it up, and the underdogs get a real chance. That's where creativity lives right up until it tips into nonsense. The second dial is top P.

Before the roll even happens, top P cuts the long tail. Out of 200,000 possible tokens. It keeps only the few dozen that together account for most of the probability. Everything else gone.

That's the whole mystery. Why the same question gets a different answer every time. It's not mood. It's not personality.

It's a random number generator pulling from a carefully reshaped deck. Image generation couldn't be more different. A diffusion model doesn't start with a blank canvas. It starts with a full one full of garbage.

The starting point is a field of pure random noise like a television tuned to a dead channel. Every pixel is random. There is no image hiding in there. No secret structure, just static.

The trick is in how these models were trained. You take millions of real photographs and you ruin them on purpose, adding a little noise, then more until each photo has dissolved into static. The model watches this destruction at every stage and learns to run it backwards. show it a slightly noisy image and it learns to estimate exactly what noise was added so that noise can be subtracted away.

That one skill learned at every level of noisiness from nearly clean to completely destroyed is the entire end. Generation is just destruction played in reverse. You hand the model pure static and ask it to undo damage that was never actually done to any real photos and it hallucinates a photograph that was never taken. Generation runs that learned skill in a C loop.

But listen to how different this loop is from the text loop. The model looks at the static and estimates the noise in it. Subtract a slice of that estimate and the canvas becomes slightly less random. Repeats typically somewhere between 20 and 50 passes.

Each one peeling away another layer of static. The order in which an image is born is fascinating. The earliest steps settled the big questions where the visual mass sits and which regions stay dark or bright. squint at the canvas, a third of the figh way through and the composition is already locked even though everything is still mush.

The middle steps carve out shapes and objects. The final steps are pure finish work. Skin texture, the glint in an eye, and here is the key contrast with text. In every single pass, every pixel on the canvas updates at the same time.

There is no left to right, no first word and last word. The image is not written. It condenses everywhere at once like a photograph developing in a dark room. Raw pixels are brutally expensive.

A large image has millions of individual values. Running doz of denoising passes over all of them is brutally slow. So modern image generators sidestep that cost with one key trick. They never work on pixels at all.

Before diffusion, a separate network compresses the image down to a much smaller grid of numbers. It keeps the meaning and throws away redundant detail. That network is called an autoenccoder. The compression is dramatic, often around 50 times fewer numbers than the raw pixels.

All of the sculpting, every denoising pass happens inside that compressed space. That space has a name, latent space. It just means the hidden representation the model works from. Only at the very end does a decoder network inflate the finished sketch back into full resolution pixels.

There is an analogy that sticks. The artist does not paint the mural directly on the wall. The artist paints a small dense study in the studio and a separate craftsman scales it up to the wall. Almost every image generator you have used works this way.

Every word you type becomes a number. A smaller language model, a cousin of the fur chat models from earlier, reads your prompt and converts it into a sequence of values. Those values get injected into every single dnoising step. Here's what that looks like in practice.

Say your prompt is a cat watching the sunset. As each patch of the image forms, the model scans those numbers and asks, "Which words describe me?" The patch resolving into fur leans toward cat, and the patch fading into orange leans toward sunset. Word after word, your caption pulls regions of static toward matching shapes. That mechanism is called cross attention.

The same one that powers text generation, just pointed sideways at image regions instead of forward through a sentence. But there's a second lever called guidance scale. The model internally runs two predictions at each step. One that follows your caption and one that ignores it entirely.

The final update amplifies the gap, pushing away from the generic version and harder toward the caption version. Turn guidance up and the model obeys more strictly, but the image starts to look overcooked. It plays the same role as temperature in text generation. Creativity versus control.

Wearing a different but the two machines side by side. The contrast is stark. The text machine works on discrete symbols. There are only so many tokens and each step commits in strict left to right order to exactly one of them 700 small irreversible decisions each conditioned on all the previous ones.

Once a token is placed it is placed. The model cannot reach back and quietly fix the third sentence while writing the 10th. The image machine works on continuous values. No vocabulary of pixels just numbers that can shift by any amount.

Nothing is committed early. Every pass revises the entire canvas at once. And a region that look like a tree at step 10 can still dissolve and become a building by step 20. Revision isn't a special feature.

It is the entire one machine is a typewriter with no backspace. The other is a sculptor who never stops touching every part of the clay at once. Same chat box, completely different physics. Now the two famous failure modes finally make sense and they are perfect mirror images of each other.

For years, image models produced gorgeous photo, but the signs read like an alien alphabet. Gibberish storefronts and menus from another dimension. Here is why. To the image machine, a letter is not a symbol.

It is just more pixels, a squiggle of light and dark. There is no vocabulary to protect it. No rule that says letters come from a fixed set. And remember, the sculpting happens in the compressed latent space where the fine strokes of small text are exactly the kind of detail that compression smears away.

The model knows lettering texture belongs on a sign. It just freestyles the strokes. The mirror failure is just as funny. Ask a chat model to draw you a to picture using keyboard characters, and you usually get something a 5-year-old child would politely decline to put on the fridge.

Spatial layout is a foreign country to pay a machine that experiences the world as Khali, a one-dimensional stream of tokens. Each machine is illiterate in exactly the dimension the other calls home. Picture a data set engineer writing captions. An old data set would label a storefront photo as a coffee shop exterior.

A newer one spells it out. A coffee shop whose sign reads daily grind in bold black letters. Do that millions of times and the model finally learns to t connect character sequences to letter shapes instead of guessing from context. That is fix one, better training data.

Fix two is the encoder which reads your text prompt before the image is drawn. Early encoders captured the gist of a fee caption but blurred the exact characters. Swap in a stronger language model as the encoder and the precise spelling survives all the way into cross attention where it shapes the actual pixel generation process. Fix 3 is the most interesting.

Some recent systems drop pros prompts entirely. They accept structured layout instructions with boxes that say this headline goes see here and contains these words in full. Tyigraphy stops being something the model freestyles and becomes a layout it executes. The painter, in other words, is learning to take dictation from the director.

Two design traditions that started apart are now meeting in the middle. The first move toward convergence comes from the text side. Take any image and run it through an autoenccoder, a network that compresses data down and then reconstructs it. Train that autoenccoder on a fixed library of a few thousand small texture patches.

Think of it as a vocabulary for visual textures. The auto-enccoder learns to describe any image as a sequence of entries from that library. That library is called a visual code book. Suddenly an image is no longer a continuous field of numbers.

It is a sequence of discrete tokens exactly like a sentence. Once an image is a token sequence, a transformer can generate it like sex. Same way it generates pros, predict the next patch. Append it.

Repeat. That is the same auto reggressive loop from the start of this video. Now pointed at pictures. Several recent Frontier models use this approach to produce images natively inside the chat model rather than handing off to a separate diffusion painter.

It is slower than diffusion, but it buys something important. The same network that read your entire conversation is the one placing every patch. That shared context is why these systems follow complicated wordy instructions noticeably better than the separate painter pipelines do. The convergence runs the other way too and this direction might be the more disruptive one.

Researchers have been building diffusion models for language sets. And here is what that actually means. You ask a question instead of generating one word at a time. The model throws down a rough draft of the entire answer all at once.

Every token blurry half decide. Then it refineses the whole draft in parallel over a handful of rounds. Words sharpen everywhere. The same way an image emerges from noise.

Early rounds lock in the structure. Later rounds settle the exact word. Two things make this exciting. The first is raw speed.

A handful of parallel refinement rounds can replace hundreds of sequential predictions. Early systems already show several times faster generation. The second is the missing backspace. A left to right model commits to each word and can never go back.

But a diffusion model revises by nature. Say the structure of the answer is wrong. The early rounds can dissolve it and rebuild it entirely. Think of how a tree in a forming image can still become a building.

The pixels were never locked in. That same flexibility now applies to language. The typewriter is learning to sculpt at the same time as the sculptor learned to type. Here's the last structural secret it hides in the kitchen.

What these machines were fed shaped everything they can and cannot do. A frontier language model trains on trillions of tokens, books, encyclopedias, code repositories, forum arguments, the written exhaust of the internet. The diet is so vast that a human reading around the clock would need tens of thousands of lifetimes to get through it. An image model trains on billions of image and caption pairs scraped from the web and filtered for quality and aesthetics.

Billions sounds close to trillions. It isn't. It's orders of magnitude and the captions are short, sloppy, written for humans who could already see the picture. That gap explains why data quality bites matter so much in the image world.

Labs that curate their training images and rewrite captions keep winning. A direct result of the diet. And here's the reveal. A language model has read everything and seen nothing.

An image model has seen everything and read almost nothing. Every weakness we covered today is downstream of those two diets. All of this theory cashes out into fear. Practical advice you can use today.

When you prompt a chat model, you are talking to the director. Give it intent, context, and constraints and let it reason. Feel free to be vague, to think out loud, to correct it mid-con conversation. Every new token is conditioned on the whole exchange.

That's the director's nature. When you prompt an image model, you are briefing the painter, and the painter only understands captions. The best image prompts read like a description of the finished picture written after the fact. A photograph of an elderly fisherman mending a net at golden hour with a harbor softly out of focus behind him.

Describe what exists, not what you want to happen. Then remember the revision difference. A chat model genuinely iterates, holding the thread of your conversation. Most diffusion models start from fresh static every single time.

So small wording changes can produce a completely different image. If you love the result, keep the seed and change one thing at a time. You are not editing a picture. You are rolling a fresh universe with slightly different physics.

One box, two completely different machines, a typewriter predicting the next token with no backspace, and a sculptor that condenses photographs out of static, revising every pixel at once. One speaks in symbols, the other in light, but the walls between them are coming down. Image models learn to spell by training on honest captions. Chat models learn to paint by treating patches of pixels as words.

Ephusion is coming for text promising parallel drafts and real revision. The end point is a C or single model that reads, writes, see, and draws with one set of weights. Until then, every great result comes from respecting what each machine actually is. Talk to the director.

Caption for the painter. Now that you have seen the machinery, you will never prompt blind again. If this changed how you think about these tools, subscribe. Next, we are opening the machine that turns text into video where both engines have to work together at 24 frames a