Skip to content
Back to all posts
Article

Running A RAG Pipeline On The Pentagon UFO Files — Real Cypher, Real Citations

Episode 2: open the actual repo and run all six stages on 115 declassified PDFs — Chroma retriever returns cited answers and a FalkorDB graph agent writes Cypher from plain English.

10 min read

Watch (15:25)


Overview

Episode 2: open the actual repo and run all six stages on 115 declassified PDFs — Chroma retriever returns cited answers and a FalkorDB graph agent writes Cypher from plain English.


Full transcript (from the video)

Last episode, I drew the architecture, every box on the diagram, and every wire between the stages, every design choice that goes into reading 4 GB of declassified UFO files. Today, we put the keyboard on it, same corpus as last time, same 115 PDFs, and the same 138 images and 28 videos to go with them. But this time we open the real repository and run every stage on the real files. By the end of the video, you are watching the Chroma retriever and the Floral DB graph agent answer real questions. The answers include citations back to specific FBI files.

Find the repo on GitHub at Michael Jameson 10 UFO. Let's run it. Before we open a single source file, I want to show you the fast path because it answers the question every viewer has. Can I clone this and use it right now? Yes.

The repo splits into three layers by cost to rebuild. The source PDFs and videos are public domain but heavy. So they are get ignored and remirrored on demand from the war.gov website. The pipeline JSONL's, every OCR pass and every entity extraction and every community summary are about 30 megabytes total and they live in the git repository. That is the real value layer because it encodes every expensive language model call.

So cloners do not pay for any of it twice. The vector store and the graph database can be rebuilt from those JSONL's in about five minutes or you skip the rebuild entirely and pull the tarball from the GitHub release page which ships Chroma and Fore DB pre-built under 10 minutes from a fresh clone to a cited answer. Open the repo. Everything important is under one folder called pipelines. Two stacks sit side by side numbered 01 through 06.

The top level pipeline runs the chromoflow. Steps are ingest, OCR, chunk, and embed, then retrieve and agent with an optional multimodal pass. The graph rag subdirectory runs the Falore DB stack. First is video frame extraction with clip embeddings. Next is image captioning.

Then entity extraction with the claude CLI. Then laden community detection. One load step pushes everything into falore DB. Last a graph agent generates cipher queries from natural language. Each stage writes a JSONL file.

The next stage reads it. No hidden state, no caching layer, no orchestration framework, just numbered Python files and JSONL's on disk. Stage 01. The job is to extract whatever text lives in the PDF and flag everything else for OCR. We use pyu pdf because it is faster than pi pdf has no java dependency like tika and returns layout preserving text rather than raw bytes.

The heruristic is the bottom three lines. If a page has 50 characters or more of real text, we keep it. If it has less, we mark it needs okar equals true and stage 02 will tesseract it later. False positives are cheap tesseract on a near empty page returns near empty text and the chunker skips it. The contract is one PDF in one record out with a list of pages each carrying its text and a boolean for whether stage02 needs to do work.

I want to stop on this number for one slide because it is the kind of number you only learn by actually running the code on the actual files. Going in, my mental model said about half the PDFs would be digital and about half would be scans. The reality is much worse. Of the 4,147 pages across 1115 documents, 3,624 of them are scans. that is 87%.

The 2024 imagery is high resolution, but it lives in PDFs as embedded JPEGs with no text layer. The mid-century FBI files are typewriter scans, so almost everything goes through tessoract. That single number is what controls how long stage 02 takes, which is the longest stage in the entire pipeline. Stage 03 is where the corpus actually becomes searchable. The recursive character splitter respects paragraph breaks first and falls back through new lines and sentences and spaces before a hard character cut.

1,000 characters per chunk with 150. Overlap is the safe default for typed government documents. The embedding model is B AI BG small version 1.5 at 384 dimensions around 120 megabytes on disk. It runs entirely on CPU at a few hundred chunks per second and if stays competitive with the closed cloud embeddings on this kind of structured retrieval. The piece that earns its keep at runtime is the metadata block.

Every chunk knows the file it came from and the page number plus the agency and the incident date and location. The chunk ID is deterministic. file colon page colon chunk which means rerunning stage 03 upserts instead of duplicating. After the run, Chroma holds 8,318 chunks indexed against the embedding column. Stage 4 is the production answer to the most asked rag interview question.

Pure dense retrieval is bad at exact strings. ask for a specific FBI case file by number and the embeddings shrug because that exact string was never in any pre-training set. BM25 catches it immediately because it just counts words. Cure BM25 in the other direction cannot paraphrase. Ask what did the FBI say about discshaped craft and it misses every page that wrote saucer.

So we ensemble them 50/50. pull the top 20 candidates and then a cross- encoder re-ranker reorders the survivors by reading the question and each chunk together rather than separately. Watch the scores in the output. The top hit is just under eight. The next two are slightly lower.

Those are real rerank scores from a small cross encoder model. The top three results are all from the same FBI section. All the right pages. That is the retriever working. Stage 05 wraps the retriever in a langraph state machine.

Three nodes, two edges, one typed state dict that every node mutates by returning a partial update. The classify node looks at the question and produces a query plan, an intent label, and any metadata filters the question implies. The retrieve node calls the hybrid retriever with that filter applied as a prefilter. So we never search outside the relevant agency or file when one is named. The synthesized node takes the retrieved chunks and writes the final answer with inline numbered citations back to specific PDF pages.

The reason to model this as a graph rather than a chain is the obvious next move. Adding a critique step that loops back to retrieve or a multihop step that does two retrievalss in sequence is a single edge change in this graph. With a plain chain, it would be a refactor. The classify node is the cleanest demonstration of why structured output exists in lang chain. The schema is a pyantic class with three fields.

Intent file filter agency filter. The language model is told to return JSON matching that schema and to leave the filters as null unless the question explicitly names one. The interesting trick here is that the same with structured output method works against the local claude code subpr not just the cloud API. So when you ask what does file 62hq83,894 say about propulsion, the model fills in a file filter and the retriever filters down to that one document before search runs. When you ask a vague question like summarize the FBI files, the filters stay null and the retriever sweeps the whole corpus.

The intent label is currently used for logging, but it is the obvious hook for a router that picks different retrieval strategies per intent later. Okay, live run, but instead of one question, let's do four. Each one is a real query against the real agent. First, were the witnesses drinking? Real answer.

The Atlanta Constitution reporter specifically noted that nobody on staff had been drinking. One exception is on file. a priest who had been drinking quite heavily, which is a sentence I did not expect to read in an FBI cable. Second, did Navy pilots ever see UFOs? A 1952 Air Force status report quotes one pilot saying, "If a spaceship flew wingtip to-wing tip formation with me, I would not report it." That single sentence captures the entire reporting culture.

Third, were sightings made by police officers? Yes, Constable Cameron filed an official report and noted he expected ribbing from the lads. Four officers in Sinclair witnessed a triangular craft at 1,000 ft. Fourth, did the FBI actually take this seriously? An Army Air Force's letter says the services of the FBI were enlisted to relieve the air forces of the task of tracking down ashkin covers, toilet seats, and whatnot.

Every one of these answers has a citation back to a specific FBI page. Plain rag handles what does file X say beautifully. It chokes on relational questions like list all incidents in 1947 with named witnesses because those are inherently so we layer in a graph. Faler DB is a Reddis protocol database that holds graph structure and vector indexes in a single process. One Docker container fast cipher hot right now.

Look at the actual numbers from the loaded database in the panel on the right. Several thousand chunks. A few thousand named people. Similar order for incidents extracted by Claude from the chunks and many more edges connecting text to entities. This is what stage three of the graphreg subdirectory builds and what stage 5 loads in under a minute.

This is the climax. Ask a relational question. Incidents in 1947 with named witness. The graph agent roots it to structured mode. Claude writes cipher match incident witnessed by person where the incident date starts with 1947.

Then return the summary along with the date and location and agency and object kind plus the witness. Alor DB runs the query against the live graph and returns 50 rows. The agent summarizes those rows into the answer on the right. Kenneth Arnold over Mount Reineer on June 24th. Civilian pilot reports nine craft information.

The original flying saucer incident. The Portland police sightings. Four different officers all named all logged on July 4th. 30 plus named witnesses across 1947 alone. every one of them grounded in a specific FBI document and traceable back through the graph.

This is what graph buys you that plain vector search cannot. The piece that ties the whole pipeline together is that every language model call in this repo goes through one wrapper and the wrapper's default is the local claude code running as a subprocess. There is no API key in the default config. The wrapper exposes the CLI as a langchain base chat model, including the with structured output method, which is what makes the classify node work without ever leaving your laptop. To switch backends, you set one environment variable ufo_lm model equals code and you are talking to the local codeex cli set it to claude sonnet 46 and you are hitting the cloud anthropic SDK.

The retrievers and the embeddings and the cross encoder and the entity extraction all run locally too. The whole pipeline is laptop cold bootable and the GPU on the DGX is just a nice to have for faster embedding. The last design decision is about how the repository travels. There are three layers separated by what it costs to rebuild them. Layer one is the source files, the actual PDFs and images and videos.

They are public domain but heavy about 3.8 GB for release one alone. get reired on demand from war.gov. Layer two is the pipeline JSONL's, the OCR output and the entity extractions and the community summaries about 30 megabytes total and they are committed to Git. That is the real value layer because it encodes every expensive language model call and a cloner does not pay for any of it again. Layer three is Chroma and Falor DB themselves get ignored because they rebuild deterministically from the JSONL's in five.

And then there is the shortcut. The Tarball release on GitHub ships the pre-built databases. So a cloner can skip layer 3 entirely. That is how this repository ships. That is the build.

every stage of the pipeline running on the real corpus, every number coming from the actual loaded database, every citation pointing back at a specific FBI page. If you came in skeptical that a pile of declassified UFO PDFs could be turned into a question answering system with cited paragraphs and a cipher back layer, I hope you leave at least convinced. The repository is public at github.com/micheljamson 10/ufo there is a release tarball there with chroma and falcore DB already built so you can clone extract and start asking it your own questions today. Episode 3 is the failure mode episode. The eval harness the entity resolution beyond exact match.

The drift problem.