Web Scraping in the AI Era: From BeautifulSoup to LLM Extraction
Every web page is structured data wearing an HTML costume. The four moves under the hood, the three-layer stack, and what AI changed about all of it.
Watch (22:59)
Overview
Every web page is structured data wearing an HTML costume. The four moves under the hood, the three-layer stack, and what AI changed about all of it.
Full transcript (from the video)
I want to open with a claim that sounds a little strange, but turns out to be true. Every webpage you have ever read is structured data wearing an HTML costume. A product page is a row in a spreadsheet. A news article is a record with a title, a date, an author, and a body. A forum thread is a parent row with a bunch of child rows hanging off it.
Web scraping, at its core, is the discipline of taking that costume off and getting the spreadsheet back. It is not a new skill. People have been doing it since the web had pages worth scraping, which is to say since the mid-1990s. What makes it worth teaching again right now is that AI has made the skill urgent in the way it was not even 3 years ago. Every large model needs text.
Every answer engine needs fresh pages, and every agent you build wants to read the live web. Scraping is back in the center of the conversation. Let me make the mechanics concrete before we go anywhere clever. Every scraper in the world, from a five-line Python script to a billion-dollar data pipeline, has the same four things in the same order. The opening move is to fetch.
You send an HTTP GET request to a URL and wait for the response, exactly the same way your browser does when you click a link. Next comes parsing. The response returns as raw HTML, JSON, or sometimes XML, and the scraper walks through that blob and picks out the fields it cares about. Then you shape the result. Extracted fields get assembled into a clean record, usually a dictionary or a row that matches whatever schema you decided on up-front.
The last move is storage. Your record goes into a CSV, a SQLite file, a Postgres table, or a cloud data warehouse, depending on how serious your project is. Fetch, parse, shape, store. That is the entire discipline. Everything else we talk about today is a wrinkle on top of those four.
Scraping always had stakes. Price monitoring, market research, academic projects. But, the stakes changed dramatically when AI entered the picture, and they changed in three different directions at once. The first shift is on the training side. Every large language model learns from a huge slice of the open web during training, often measured in the trillions of tokens.
And those tokens come from a crawl that scrapes the internet at planetary scale. Retrieval is the next one. Answer engines like Perplexity, ChatGPT, and Google AI Overviews keep crawling the web constantly, so their answers reflect what is on the internet today, not what it looked like last year. Applications are the third. Developers building agents, research tools, and dashboards want those tools to read the live web, which means a scraper sits somewhere in the stack.
Meanwhile, publishers and site operators are watching all three trends at once and deciding they would like to push back. That push and pull is the environment every new scraper lands in. When you are learning this, the single most useful mental model is to think of scraping in three stacked layers. At the bottom is an HTTP client. All it does is send requests and receive responses.
Python has requests, HTTPX, and AIOHTTP. Command line tools like curl and wget live here, too. The middle layer adds a parser. Once you have the raw HTML, you need a way to walk the tree and pull out the fields you want. Beautiful Soup is the classic choice in Python.
X M Air and in JavaScript, you would reach for Cheerio. The top layer is a full headless browser. Instead of reading raw HTML, you launch an invisible copy of Chrome or Firefox and let it execute the page the way a real user's browser would. Playwright and Puppeteer are the modern defaults. The rule I teach every beginner is this.
Always reach for the shallowest layer that works. A plain HTTP call is faster, cheaper, and more reliable than a browser. Only climb the stack when the page forces you to. This is the combination that I tell every beginner to learn first. And honestly, it is the combination I still reach for first myself.
Request sends an HTTP get to the URL and hands back a response object. Beautiful Soup takes the response text and parses it into a tree structure. You can navigate using CSS selectors, exactly the same selectors you would use in your browser's dev tools. The example on screen walks through a list of article cards, pulls the title out of the H2, reads the ISO date from a time tag's datetime attribute, and grabs the link. It is 30 seconds to write.
It runs in a teeny tiny fraction of a second, and it will work unchanged on the vast majority of blog style pages on the open web. For simple directory pages, listings, archives, article feeds, public documentation, and government open data portals. This is all you need. Do not underestimate how far this small toolkit goes. A lot of professional pipelines are just this pattern, run in parallel with retries and a database at the end.
Eventually, you will run your first requests plus beautiful soup script against a web app and get back an almost empty HTML file. The body contains a single div with an ID of root and basically nothing else. That is your signal that the page is a single page app. The content you see when you visit the site in a real browser is not in the initial HTML. It is fetched and rendered by JavaScript after the page loads.
A raw HTTP client cannot see that content. At that point, you climb to the third layer. Playwright, which I use most, launches a real Chromium browser under your control, loads the page, lets the JavaScript finish running, and then lets you query the fully rendered storm. The trade-off is that it is 10 to 100 times slower than a plain HTTP call, uses way more memory, and is much easier to get blocked on. But for a certain class of pages, it is the only thing that works.
Use it when you must, not when you can. This is where the world gets interesting. Today, sites do not just check your user agent string and call it a day. Bot detection now runs at several layers at once. At the network layer, servers can fingerprint your TLS handshake.
A Python script using the requests library has a distinctly different TLS signature than real Chrome, and services like Cloudflare can spot that in milliseconds. At the browser layer, detection scripts measure how your browser renders a hidden canvas, how your WebGL reports its driver, and how quickly your JavaScript engine responds to small asynchronous tasks. Real browsers have giveaways, but so do headless ones, and the defenders know the difference. At the behavior layer, systems watch your mouse path, your scroll velocity, how long you dwell on each field, and whether your clicks have human-like jitter. Any one of these signals can be off.
When enough of them are off, you get a challenge, a CAPTCHA, a slow drip of errors, or a clean block. Beginners are usually surprised by how quickly a naive scraper gets caught. This is why. The defensive stack is not a single product. It is a stack of commercial services and open-source projects, each tuned for different kinds of sites.
Cloudflare is the one you will meet first and most often. Their turnstile challenge and BotFight mode are now turned on by default for a huge share of the web, and any scraping tutorial written before 2024 is probably out of date specifically because of Cloudflare. DataDome and Kasada sit in front of most big retail, travel, and ticketing sites. Perimeter X now protects a lot of financial and enterprise portals. And on the open-source side, Anubis, a proof-of-work gate that forces your client to burn a little CPU before it serves a page, has taken off specifically because AI crawlers were hammering open forges, wikis, and indie servers.
Anubis and its cousins are the reason many small self-hosted sites suddenly feel slower. The arms race is real. It is funded on both sides, and it is the backdrop for every decision you make as a scraper. Before we talk legality, I want to make sure the three documents you are going to meet are clearly separated. robots.txt is a small plain text file at the root of a site that says which paths are allowed for which crawlers.
It is a social contract, not a technical gate. A rude scraper can ignore it entirely, and the server will still serve the page. Ignoring it is a bad idea for every other reason, though, because the site's terms of service usually references robots.txt directly, and the TOS is a real legal document. The newer development is AI robots.txt, which is the same file format with new new agents carved out for AI crawler specifically. Names you will see include GPTBot, ClaudeBot, CCBot, and a dozen others.
Alongside that, LLL-Stay.txt is a newer and still evolving spec independent of robots.txt, where publishers write a clean markdown index of their content specifically, so that language models have a good summary to read. If you are building a scraper today, check all three. I'm going to give you a short, non-legal advice summary of where the law sits in the United States, because people get scared off scraping entirely by stories they read online, and the real picture is more nuanced. The headline case is HiQ Labs versus LinkedIn, which worked its way through the courts for years, and ultimately established that scraping publicly available data is not by itself a violation of the Computer Fraud and Abuse Act. That is important.
It means simply reading a public page with a script is not a crime. Copyright, however, is a completely separate body of law, and it very much still applies. The text and images you download are owned by their creators. Using them for personal research is one thing. Republishing them, resell, or training a commercial product on them without permission is very much another thing.
The line I give students is this: A personal learning project is almost always fine. A commercial pipeline deserves a conversation with a lawyer. Here is the move that changed everything for teaching scraping. You can hand a full HTML page to a modern language model as a prompt and ask it in plain English for the fields you want. The model reads around broken markup, deeply nested divs, weird class names, and inconsistent formatting, and it hands you back a clean JSON object with exactly the keys you ask for.
This is wonderful for one-off jobs, for the long tail of sites you only scrape occasionally, and for sites whose layout changes often enough that writing a traditional selector is a maintenance trap. The flip side is cost and speed. A single extraction is pennies and a second or two. A thousand extractions a minute is tens of dollars an hour and often hits rate limits. So, the LLM extraction path, not a replacement for the classic parser path.
It is a different tool for a different situation, and a mature pipeline will typically use both. The next slide shows how to make LLM extraction reliable. When I teach LLM-based extraction, the first upgrade I show is always schema-first design. Instead of asking the model to produce JSON in free form, and then hoping it matches what your database expects, you declare a pedantic model up front. Title is a string, price is a float, in stock is a boolean, SKU is an optional string.
Then you pass that model as a structured output format on the API call. The modern Anthropic and OpenAI SDKs both support this direct The library enforces the schema at parse time, which means any hallucinated field, missing field, or wrong type blows up loudly at the exact moment of extraction before bad data reaches your table. The Pydantic model becomes the single contract between your scraper, your LLM, your validation step, and your storage layer. Beginners love this pattern because it turns a fuzzy task, reading a messy page, into a sharp task, filling in a typed struct. And the model is better at sharp tasks than fuzzy ones.
When you look at real production scraping systems in 2026, almost none of them are pure LLM, and almost none of them are pure selectors. They are hybrid. The cheap deterministic parser runs first. It is fast. It is close to free, and when it works, it works perfectly.
You validate the output against your Pydantic schema. If the validation passes, you are done. Ship it. If the validation fails, typically because the site changed its class names or added a new wrapper div, you fall back to the LLM extraction path. That path is slower and more expensive, but it tends to succeed where the selector failed because the model is reading the content, not the markup.
And critically, you log every fallback because a fallback is a signal that your selector needs attention. Over time, the log becomes a prioritized maintenance list. You pay for intelligence only when the cheap path breaks, and you get a free feedback loop that tells you which sites to repair next. If you have tried to run a headless browser at scale yourself, you know how quickly the ugly parts pile up. Chrome updates, proxies, captchas arrive, fingerprints drift.
A whole category of managed services has grown up specifically to absorb that pain, and they are worth knowing even if you do not use them yet. Firecrawl takes a URL, renders the page, parses out the content, and hands you back clean markdown ready for an LLM. It is my default for any research style ingestion. Browserbase rents out stealth headless browsers in the cloud with good fingerprints, good proxies, and a Playwright compatible API. You keep your scraper code, they keep you unblocked.
Apify hosts full scraper workers with scheduling, queuing, and data storage baked in, and has a marketplace of pre-built scrapers for common sites. None of these are free, but they all trade money for your time, and for many projects that is an excellent trade. Know they exist so you can reach for them when your own stack is becoming a second job. There is a newer pattern I want to be honest about because it is getting hyped hard and oversold. The LLM browser agent.
You give a model a task in plain English plus browser tools, and it clicks around a site the way a person would, reading pages, deciding where to go next, and extracting information as it finds it. Browser use, Stagehand, and the various agent frameworks all land in this. These agents are genuinely impressive, and they are the right tool for a specific shape of problem. Research workflows where the task is open-ended and you cannot write a selector in advance are a great fits. Cross-site comparisons where the model needs to judge what counts as equivalent information are another good fit.
But for the kinds of jobs that scraping pipelines have traditionally served, pulling a thousand product rows a day from a catalog, an agent is almost always the wrong tool. It is 10 times the cost, a hundred times the latency, and much harder to make reliable. Match the tool to the shape of the work. Here is the habit I wish more people learned before they wrote their first scraper. Check whether the site offers an official pipe before you build an unofficial one.
Almost every real site on the web publishes a site map at /sitemap.xml, which lists every page the site wants indexed, usually with last modified timestamps. For a news site or a blog, a sitemap plus a few HTTP calls gets you the same content as a scraper at 1/10 the effort. Most sites also publish an RSS or Atom feed, which is a purpose-built update stream specifically designed for this use case. A surprising number of sites have a shoe public JSON API sitting right next to the user-facing pages. Sometimes documented, sometimes discoverable by watching your browser's network tab.
When any of these official pipes exists, use it. It is faster, more stable, and politically safer. Scraping is the fallback, not the default. Train yourself to check the pipes first, then write the scraper only when nothing else works. The longest-running scrapers I have ever seen were not the cleverest ones.
They were the politest ones. Pace your requests. If the site's crawl delay is 10 seconds, obey it. If there is no crawl delay, default to one request per second per host, which is the historical standard. Identify yourself honestly.
Put a real project name, a version number, and a contact email in your user agent. Site operators overwhelmingly choose not to block scrapers they can reach out to because a scraper with a name is less likely to surprise them. Cache aggressively. When you fetch a page, store its ETag header and its Last-Modified header. And on the next pass, send those back as If-None-Match and If-Modified-Since.
If the page has not changed, the server returns a 304 with no body, which costs almost nothing and signals respect to the operator. Clever tricks get you blocked. Politeness gets you into a long-term relationship with a site, which is much more valuable if you plan to scrape it for months. The thing nobody warns you about when you are starting is that scrapers decay. You write a beautiful extractor.
You ship it. It runs perfectly for 2 weeks. And then, on a quiet Tuesday, the site's front-end team refactors the product card component, renames a class from product_price to p_current, and your scraper silently starts returning none for every price. The scraper does not crash. It just gets quieter and wronger until someone notices.
The only defense is to measure. Pick a small set of known good pages, freeze them as fixtures, and run your extractor against them on a schedule. Track coverage, which means the percentage of expected fields that came back non-empty. When coverage drops below a threshold, say 95%, alert yourself. Separately, budget about 1 hour per site per month for selector drift repair, because that is roughly the real rate.
Scraping is not a one-time project. It is a garden. If you do not weed it, it disappears. If you are brand new to all of this and you want a concrete next step, here is the starter project I give to every learner. Pick a single site you already care about, a blog, a forum, a release notes page, a Hacker News-style aggregate, anything you look at every week anyway.
For your first week of practice, write a requests plus beautiful soup script that pulls the titles and dates into a CSV. Get it working once, then the following week put that script on a daily cron and commit the CSV output to a get repo. Watch the diffs. The first time your scraper breaks because the site changed, debug it. That is the real lesson of scraping and you cannot skip it.
Midway through the month, upgrade to playwright and run the script again against a JavaScript heavy site of your choice. By week four, add an LLM extraction fallback using Pydantic and log every time the selector path fails. After 30 days of this, you will know more practical scraping than 90% of the people who claim the skill on a resume and you will have the scars to prove it. If I had to compress the whole video into a single card, this is what would be on it. Scraping is structured data recovery, not hacking.
You are turning pages back into the spreadsheet they already are and that framing will keep you out of both legal and ethical trouble 99 times out of a hundred. AI changed the stakes on both sides at once. Offenders got better tools, LLM extraction and agent browsers. Defenders got better detection, fingerprinting and proof of work gate. You are standing in that push and pull every time you write a scraper and knowing which side is winning on your target site is part of the job.
Match the tool to the page. Shallow stack first. Browser only when you must. LLM only when the deterministic path fails. And finally, the scraper the operators do not feel the need to block.
Identify yourself, cache your requests, respect the rate limit and read the robots.txt