Skip to content
Back to all posts
Article

Teach AI To Read UFO Files (LangGraph + RAG)

Episode 1: the architecture of a six-stage RAG pipeline plus a GraphRAG layer in LangChain + LangGraph, built to read 4 GB of declassified UFO files on a single workstation.

9 min read

Watch (13:39)


Overview

Episode 1: the architecture of a six-stage RAG pipeline plus a GraphRAG layer in LangChain + LangGraph, built to read 4 GB of declassified UFO files on a single workstation.


Full transcript (from the video)

Wait, you can just read these now? The Department of War just published 4 GB of declassified UFO files on a public website. No paywall, no FOIA request, just sitting online. FBI flying disc reports, navy sensor stills, a composite sketch from 2024, and Apollo era 115 PDFs, 138 image and 28 videos. Nobody is going to read all of it.

I am not going to read all of it. So, instead, I built an AI to read it for me, and today I want to show you how that AI is wired together. Here we go. The US Department of War just dropped 4 GB of declassified UFO files onto a public website. 150 138 images, 28 videos.

Reading all of that cover to cover would take a month, and I would still forget what was in the first PDF by the time I got to the last. So, I did the obvious thing. I built a LangGraph agent to read it for me. And today I'm going to walk you through exactly how it's wired together. This is episode one of the series, the architecture dive.

Every box, every wire, every design choice that goes into turning a pile of typewriter era scans into something you can ask questions of. Episode two is the hands-on build where we actually run the thing end to end. Let's start with what's in the box. Before we can talk architecture, I want you to see the corpus the way the pipeline sees it. The release sits behind a view front end and an Akamai gate, but underneath is a flat manifest CSV pointing at 115 PDFs, 138 images, and 28 DVD.

About half of the PDFs are typewriter era scans with no text layer. The other half are clean digital documents. The images are a mix of FBI sensor stills and slideshow grabs. The videos are short clips, most under 30 seconds. The whole pipeline lives under the pipelines folder, organized as six numbered stages, plus an optional graph rag stack in its own subdirectory.

Every stage walks every release directory automatically, so when the next tranche drops, it is a rerun, not a rewire. The stages are numbered in the order they run, and each one writes a JSONL file the next stage can pick up. Stage 01 extracts the text layer with PyMuPDF and flags any page that came up empty as needing OCR. Stage 02 runs Tesseract on those flagged pages. Stage 03 takes everything and chunks it, embeds the chunks with the BG small model, and writes the result into a chroma vector store.

Stage 04 wires up hybrid retrieval. Stage 05 wraps the retriever in a LangGraph agent that produces cited answers, and stage 06 is optional multimodal. Whisper transcribes the videos, and Claude Vision captions the images, and both outputs feed back into stage three, so they get indexed alongside the PDFs. This is the single most asked production rag question, so it gets its own slide. A pure dense retriever is bad at exact strings.

Ask for FBI file 62HQ83894, and the embeddings shrug because that string was never in any pre-training set. BM25 catches that case immediately because case IDs and proper nouns are exactly what term frequency was invented for. Pure BM25, on the other hand can't paraphrase. Ask, "What did the FBI say about disk-shaped craft?" and you'll miss every page that wrote saucer. So, you ensemble them 50/50 by default, and then a cross-encoder re-ranker reorders the top results so the actually relevant chunks come first.

On top of all that, every chunk carries metadata: agency, incident date, file name, so you can filter before retrieval ever runs. That's the whole production stack in one paragraph. Three details on this stage matter more than they look. First, chunk size of 1,000 characters with 150 overlap is the safe default for typed documents. Large enough to carry one coherent paragraph, small enough that retrieval stays precise.

Second, BG small is the right embedding for a project that has to run anywhere about 120 MB. No GPU required, and it's competitive with much larger models on this kind of structured retrieval. Third, the chunk IDs are deterministic: which means rerunning stage three doesn't duplicate anything in Chroma, and you can ingest a new tranche without a full rebuild. The metadata fields: agency, incident date, file, are the secret weapon because they let stage four do filtered retrieval instead of brute-force search. A naive rag chain is one-shot: prompt, retrieve, synthesize.

That falls over the moment a query asks for two different filters, like compare the 1947 FBI sightings to the 2024 Navy encounters. LangGraph lets you model the answer flow as a state machine. There are three nodes. The classify node looks at the question and produces a query plan, an intent label, and any metadata filters it could extract. The retrieve node calls the hybrid retriever with that filter applied.

The synthesize node writes the final answer and inserts inline citations. The state dict is just a type dict that every node mutates by returning a partial update. The reason to model it this way isn't the current three nodes, it's that adding a critique step that loops back to retrieve or multi-hop step that does two retrievals in sequence becomes a one-line edge change. The classify node is the cleanest demonstration of why structured output exists. The schema has three fields, an intent label, an optional file filter, and an optional agency filter.

The LLM is told to return JSON matching that schema and to leave the filters null unless the question explicitly names one. So, if you ask, "What does file 62 HQ83894 say about propulsion?" The model fills in the file filter and the retriever pre-filters the corpus down to that one document. If you ask a vague question like, "Summarize the 1947 incidents." it leaves the filters null and the retriever searches the whole corpus. The intent label is currently used for logging, but it's the obvious hook for a router that picks different retrieval strategies per intent later. Citation handling is one of those things that looks trivial and is actually the difference between a useful rag and a confidently wrong one.

The pattern here is number every retrieved chunk in the prompt itself, force the model to cite by number, and post-process the output to map those numbers back to file and page. The model never invents a citation because it never sees the original file path. It sees a number and the driver knows which document that number belongs to. The synthesize prompt is also explicitly defensive. Use only the provided source and if the sources don't answer the question, say so.

That last clause is what stops the model from quietly making things up when retrieval misses. It's the smallest prompt change with the largest reliability payoff. Plain rag handles what does file X say perfectly. It chokes on questions like compare 1940 seven sightings to 2024 sightings or show all FBI cases with named witnesses because those are inherently structured. The graph rag layer is the answer.

Stage three of the graph rag subdirectory uses Claude to extract entities and incident records from every chunk. Stage four runs the Leiden community detection algorithm. It clusters related entities into themes, then writes a one paragraph summary per cluster. Stage five loads everything into Falcor DB. That includes chunks, frames, entities, incidents.

Falcor DB is a Redis protocol database that holds both graph structure and vector indexes in a single process. So, one Cypher query can hop entities, filter by date, and rank by vector similarity. The graph agent is itself a LangGraph state machine but its router has three retrieval branches instead of one. Structured questions, anything that names entities, dates, or relations get routed to text to Cypher. Claude writes a Cypher query.

Falcor DB runs it. The rows come back as the answer's source list. Semantic questions go through the vector index, which now spans both text chunks and clip embedded video frames. So, show me grainy black and white footage of a disk near a hanger actually returns video frames that look like that. Global questions, what are the themes, route to the community summaries because those are precomputed paragraphs about clusters of related and and the hybrid mode runs all three and merges, which is what you want for cross-cutting questions that span multiple retrieval strategies.

The piece that ties this whole pipeline together is that every LLM call goes through one wrapper and the wrapper's default is the local Claude code CLI running as a sub process. There's no API key in the default config. The wrapper exposes the CLI as a LangChain base chat model, including the structured output method, which is what makes the classify node work without ever your machine. To switch backends, you set one environment variable. Set UFO LLM model to Codex GPT-5 and you're talking to the local Codex CLI instead.

Set it to a Claude model name and you're hitting the hosted Anthropic or OpenAI SDK. The retrievers and embeddings plus the rerank model all run locally, too. The whole pipeline is laptop cold bootable and the GPU on the DGX is just a nice-to-have for faster embedding. The last design decision is about how the repo travels. There are three layers separated by what it costs to rebuild them.

The source files, the actual PDFs, images, and videos are public domain, but heavy, about 3.8 GB gigabytes the first release alone, So, they're Git ignored and re-mirrored on demand from the government archive war.gov. The pipeline JSONLs are about 30 megabytes total and they are committed. They cover the OCR output, the captions, entity extraction, community summaries. That's the real value layer because it encodes every expensive LLM call and the cloner doesn't pay for any of it again. The vector store and the graph database are Git ignored because they rebuild deterministically from the JSONLs in about 5 minutes.

But, there is also a fourth option, which is the shortcut. I published a tarball release on GitHub that ships Chroma and Falcor DB pre-built. So, you can clone the repo, extract the tarball, mirror the source files, and start asking questions in under 10 minutes. No LLM cost, no rebuild wait. The download link is in the release notes.

That was the architecture. Every box on the diagram, every design choice. If you came in skeptical that a pile of declassified UFO PDFs could be turned into something a language model can answer questions from with citations, I hope you leave at least curious. The repo is public at github.com/michaeljameson10/ufo_ and there is a release tarball there with the Chroma vectors and the Falcor DB graph already built. So, if you just want to talk to the corpus without waiting through any rebuild, you clone the repo, extract the tarball, and start asking it questions today.

Use it for your own research, your own studies, whatever. Episode 2 is the hands-on build. We start from a fresh clone, watch each pipeline stage actually go, and end with the agent answering a real question with inline citations back to specific FBI files. Bring a question you genuinely want answered from the corpus and we will run it on camera. Star the