How to Get Cited by AI Answer Engines
What actually makes a page show up inside ChatGPT, Perplexity, and Google AI Overviews — and how to write for it.
Watch (18:19)
Overview
What actually makes a page show up inside ChatGPT, Perplexity, and Google AI Overviews — and how to write for it.
Full transcript (from the video)
for about 20 years. The front door to the internet was a Google results page.
You typed a question, you got 10 blue links and you picked one. Ranking on that page was the entire game and the whole SEO industry grew up around it.
That front door is changing faster than most creators realize. When someone asks an AI assistant a question now, the assistant reads the web on their behalf and writes a single synthesized answer.
The user almost never clicks through. Your video either gets cited inside that answer or it effectively does not exist for that viewer. Your old SEO instincts still help. Clear titles, strong metadata, and good content still matter.
But the ranking surface shifted underneath you and the way your content gets discovered is no longer the same mechanism it was even 2 years ago. When I say AI search, I mean a specific thing. I mean any tool where a user types a question and a language model produces a written answer grounded in current web content. Chat GPT does this by default now for most factual questions. Perplexity was built around live retrieval from the start and every answer comes with a list of sources in line. Google AI overviews the thing that appears above the blue links on a growing share of queries is the same pattern wearing a different coat. Claude and Gemini both do this too, and their share of overall search volume is climbing quarter over quarter. If you are thinking about where your audience discovers new creators, you have to treat all four of these surfaces as a distribution channel. They are not toys and they are not niche. They are the new first stop for a meaningful slice of your viewers. Under the hood, every one of these systems runs the same three-step loop. Step one is retrieval.
When a question arrives, the system searches a huge corpus of web content, looks up a few dozen candidate passages that might answer the question, and pulls them into a short working context.
Step two is ranking. Inside that short list, the model weighs each passage on relevance to the exact question and on trust signals like domain authority and freshness. Step three is composition.
The model writes a clean human readable ant in almost every modern system. It cites a small handful of the sources it actually leaned on. That three-step loop is everything. Understanding it tells you exactly where to push to get your video considered, ranked, and cited.
Here is the part that most creators miss. Your video is not invisible to these systems. YouTube is one of the largest and richest corper on the open web and every major answer engine indexes it aggressively. The video title, the description field, the chapter markers, the pinned comment and critically the full autogenerated or uploaded transcript are all available to the retrieval layer. When the ranker compares a passage from your transcript against a paragraph scraped from a blog, they sit on roughly equal footing as long as yours is clear, specific, and well structured. When your video wins, the answer engine cites you, and the citation links back to your watch page with a timestamp that is a direct funnel from an AI answer to a new subscriber, bypassing the traditional YouTube algorithm entirely. I want to be very direct about this next point because it is the single highest leveraged change most channels can make. The language model does not watch your video. It does not look at the thumbnail. It does not hear your voice. It does not see your face. It reads the transcript. Whatever YouTube autogenerates is usable, but it is noisy. It misplaces punctuation. It confuses homophones. and it drops speaker specific jargon exactly in the spots that would have ranked you. Taking 20 minutes to handcorrect the transcript or to upload one you produced alongside your script gives the retrieval layer a clean document to pull from. Hand corrected transcripts outperform auto captions on coverage and on the kinds of literal phrase matches that retrieval loves. It is the single cheapest quality upgrade available to you and almost nobody does it. Let me make this concrete. The top block is what auto captions produce from a real creator. I reviewed it has the right energy. It is how people actually talk, but it is a disaster for retrieval. The tool name is wrong. The sentence has no punctuation.
The key technical claim that it handles callbacks is buried in informal phrasing. The bottom block is the same content, cleaned up, proper sentences, correct product name. The technical claim stated in the words, a search query would actually use web hook retries and exponential back off. That second version will get pulled into an AI answer. The first one will not. The gap between these two costs you nothing but 15 minutes of editing per video and it compounds across your entire library.
Chapters are the second huge lever. When you mark chapters on a YouTube video, each chapter becomes an independent unit that retrieval systems can pull in isolation. A chapter called part one hyphen setup is a dead chapter. Nobody is ever typing that into a search box. A chapter called how do I install verbageto on Mac OS though maps almost word for word onto a real query. That is the chapter that gets surfaced inside a chat GPT answer when someone asks about installation. Rephrase every chapter as the question a user would actually type.
Do this at edit time before you publish. It takes an extra 2 minutes per video and it turns each chapter into a miniature searchable answer in its own right. This is the single easiest structural win on your checklist. The full metadata stack has four fields that matter for AI retrieval and you should treat each one as structured data. The title is where you put the most specific most searchable phrase front-loaded inside the first. Nobody is typing a cute clever title into a search engine.
So save the clever for the thumbnail and make the title literal. The description's first two lines are critical because they are what shows up in the retrieval snippet view. So put your thesis sentence there. The chapters, as we just covered, should be questions and the pinned comment is an underused superpower. Write it as a short FAQ. three to five of the real questions your video answers, each followed by the answer in one or two sentences. The pinned comment gets crawled like any other text on the page, and a well-written one gives the retriever a compact, high signal version of your whole video. Here is what that all looks like as a real video description. Notice the thesis sentence up top. It says exactly what the video covers in a single readable sentence that will show up cleanly in any retrieval snippet. Then a bulleted list of the concrete subjects. Then the literal chapter table, each entry phrased as a question, and one clear outbound link at the end. No marketing copy, no emoji parade, no 37 social links. The goal of this description is to be a machine readable index of what is in the video. If a language model reads only the description, it should already be able to answer the main question your video addresses. That is the bar. Here is a surprising finding from how modern answer engines actually rank. A single source, no matter how authoritative, gets discounted when it sits alone. If your video is the only place on the open web that makes a particular claim, the ranker treats the claim as risky and will often cite it with a hedge or pass on it entirely in favor of a weaker claim made by multiple sources. The fix is corrup. Your thesis needs to appear in slightly different words in at least two other places. A short blog post that links back to the video. a Reddit comment in the relevant subreddit that makes the same point. A newsletter issue that references it. You are not gaming the system. You are giving the retrieval layer the independent confirmation it is built to look for. Here is the minimum publish triangle I recommend for every video you care about ranking. The YouTube video is the anchor, the thing you spent real time on. Same day or next day, publish a written version on your own domain. It does not have to be a full rewrite. Pull the thesis, pull the three strongest bullet points, and link the video. Then post a single social thread summarizing the thesis with the link LinkedIn X, Blue Sky, whichever platform your audience lives on. The optional fourth leg is a Reddit comment, but only where the topic naturally fits and only if you have standing to post there. Do not spam. Do not astroturf. The triangle works because it is genuine independent echoes of the same claim across domains with different authority signals. And that is exactly the corroboration pattern. The retrieval layer is trained to reward. Now for the step almost no creator does on your blog post. The written companion to the video embeds structured data using jsonldd.
Schema.org or defines a video object type that was built for exactly this case and it tells any crawler, human or machine that a specific chunk of your page is a video with these specific properties. Inside video object, there is a subpropy called speakable and it is especially relevant to AI answer engines. The speakable block explicitly marks the one or two sentences on the page that are the cleanest, most answer ready summary of the content. When a voice assistant or an AI answer engine is deciding what to quote, the speakable block is a strong hint about what to pull. This sounds advanced. It is not.
It is about 12 lines of JSON. Most creators skip it and leave easy ranking points on the table. Here is the full block. You can copy this. Change five strings and be done. The name and description should mirror what a human reader sees at the top of your blog post. Thumb dental URL upload date and the content URL back to the YouTube watch page. Let the crawler wire up the identity graph. The speakable section at the bottom is the critical part. It uses XPath to point at specific elements in your page, typically one with ID thesis and one with ID summary. And it tells the voice and AI answer layer those are the sentences you want read aloud or quoted. Put this inside a script tag with type application/ld plusjson in the head or body of your blog post. That is all it takes to go from generic content to machine readadable content with explicit pull targets. Here is a pattern that took me a while to accept and it runs counter to everything the hype economy rewards. One viral video almost never moves your AI search ranking. Retrieval layers do not care about view counts anywhere near as much as they care about topical consistency. If you have 20 videos all covering variations on one narrow topic, the ranker learns your channel is a reliable source on that topic. After that, every new video you publish on that topic gets treated as trustworthy almost by default. This is what SEO people call topical authority and it applies to AI search even more forcefully than it applies to classic search. The practical implication is uncomfortable for most creators. Pick one narrow lane. Stay in it for six months. Resist the temptation to chase whatever is trending this week.
Compounding authority is invisible until it kicks in and then it pays off on every future video you publish. Once you pick a lane, reinforce it through interlinking. Every video you publish should link to two or three of your own related videos, both in the description and in the pinned comment. Every blog companion post should naturally reference a few prior posts on the same topic. You are drawing a map for the retrieval layer. A map that says this is a cluster of content all about one topic all on one channel and one domain. When a query lands anywhere inside that cluster, the model sees the whole neighborhood and weights the cluster as coherent and trustworthy. without interlinking. Each video is a lonely island with it. Each video is a node in a connected territory. Connected territories rank. Lonely islands do not.
Here is the frustrating part nobody likes to hear. You cannot see any of this working, at least not directly.
Your YouTube Studio dashboard will not show you a row called chat GPT referrals. Your web analytics will not tell you that a specific view came from a perplexity answer. The traffic from AI answer engines arrives most of the time as direct traffic or as organic noise with no useful breakdown. This is uncomfortable for anyone used to the dashboards that classic SEO provided.
But there is no dashboard yet and there may not be one for a while. You have to accept this blindness as part of the new reality and measure differently. You measure by prompt testing and you measure by watching leading indicators.
Both are covered in the next two slides. Here is the method I use instead of a dashboard. Every week, take the three or four queries you would most like your content to answer and type each one into each of the big four answer engines.
Chat GPT, Perplexity, Google's AI overview. Write down what each engine sites. Note whether your content appears in any of them. Keep a simple spreadsheet. One row per query per engine per week. After four to 6 weeks, you will see patterns. You will see queries where you are reliably cited.
You will see queries where a competitor is reliably cited. You will see queries where neither of you gets pulled in and the engines are quoting random blogs nobody has heard of. That last bucket is your biggest opportunity. Those are open gaps you can move into with focus content and proper metadata. Here is a real prompt test result anonymized. Same query, four engines, dramatically different citation. Chat GPT did not site the video, but cited the official docs and a dev blog. Perplexity cited the video directly, which means the transcript and the chapter structure did their job. Google AI overview pulled a Stack Overflow answer and the docs, which means there is probably a gap in my coverage of error messages people actually search for. Claude cited the blog companion post, not the video, which is evidence that the JSON LD and speakable markup on the blog is paying off. Each of these four results tells me something different and actionable. The video is doing well on perplexity but not on chat GPT. The blog is doing well on Claude. The Google path needs better coverage of specific error messages.
Without this test, I would know none of that. Here is the operational cadence. I would recommend if you take only one thing from this entire video. Monday, publish the video, the blog companion, and the social thread at the same time.
Tuesday and Wednesday, go back and handcorrect the transcript. Update the chapters to be literal questions and make sure the pinned comment FAQ is in place. Thursday and Friday, reply to comments and if the topic fits, drop one substantive Reddit comment with a backlink. Then you wait. Retrieval layers typically take two to three weeks to fully index a new piece of content and reflected in citations. So roughly 2 weeks after each publish, come back, run the prompt test on the queries that video was targeting, and use the results to decide what to change in the next video. This loop is simple, but it compounds. Each pass, makes your map a little denser, your topic a little more authoritative, and your next video a little more likely to be cited on day one. If I have to compress all of this down to one sentence, it is this the AI layer that now sits in front of search reads. It does not watch and it rewards clarity, structure, and corroboration.
Clarity means clean transcripts and specific titles. Structure means chapters as questions and metadata that reads like an index. Corroboration means your thesis appears in at least three independent places across the web with structured data tying them together.
Measure by prompt testing because the dashboards do not exist yet. Stay in your narrow lane for 6 months so topical authority can actually compound. Do those five things consistently and in 3 to 6 months you will start seeing your content cited inside the answer a stranger gets when they ask an AI something in your territory that is the new distribution. It is weirder, slower,