Every Case Ever Decided
On building a search engine over all of American case law
March 31, 2026
I built a search engine over every American court opinion ever published. 10.7 million opinions, 1685 to present, all fifty states plus federal. 77 million searchable passages. It runs on a Mac Mini in a closet.
This is a project post. I want to talk about what it is, how it works, what it costs, and why I think the interesting version is the one where I give it away.
What it is
CourtListener, run by the Free Law Project, maintains the largest open repository of American case law. When Harvard’s Caselaw Access Project shut down in March 2024, all their data migrated there. The corpus is CC0, public domain, free to download.
I downloaded all of it. 52 gigabytes compressed. Opinions, citations, dockets, parentheticals, judge records, financial disclosures. I loaded it into PostgreSQL on an M1 Max with an 8TB drive, then spent a week embedding every opinion into a 384-dimensional vector space using sentence-transformers. The result: 77 million passages in a pgvector index, searchable by meaning rather than keywords.
On top of that, I built a Shepardizer.If you are not a lawyer: Shepard’s Citations is a service that tells you whether a case you are relying on has been overruled, distinguished, or affirmed by later courts. LexisNexis charges serious money for this. The open citation graph in the CourtListener data has 76 million links between opinions. From that graph plus the parenthetical records, I built a table of 332,000 treatment classifications. You can ask “is this case still good law?” and get a real answer backed by the citation chain.
How it works
You type a question in natural language. The system converts your question into the same 384-dimensional vector space as the passages, finds the nearest matches by cosine similarity, and returns them ranked by relevance. Each result comes with the case name, date, court, citation count, and Shepardizer signals.
The embedding model is all-MiniLM-L6-v2, a general-purpose sentence transformer. 90 megabytes. It runs on CPU in about 50 milliseconds per query. The vector search uses an IVFFlat index with 8,000 clusters over the 77 million passages, which brings query time down from minutes (brute-force over 218 gigabytes of vectors) to seconds.
Here is an actual query and its results:
Query: “qualified immunity police excessive force”
- Greene v. Feaster (2019): “providing standards for excessive force and qualified immunity”
- White v. Traino (2000): “qualified immunity defense inapplicable when officer uses excessive force or engages in brutality”
- Modacure v. Short (2023): “overcoming qualified immunity is especially difficult in excessive-force cases”
Real cases. Real holdings. Spanning decades. No hallucinations.
The hallucination problem this solves
Over 700 court cases now involve AI-fabricated citations. A lawyer asks ChatGPT for supporting case law, gets a confident-sounding citation, files it, and the opposing counsel or judge discovers the case does not exist. This is the single most embarrassing failure mode for AI in the legal profession.
The fix is obvious: do not generate citations from a language model’s memory. Look them up. A RAG system over the actual corpus returns cases that exist, with holdings that are correctly characterized. The parenthetical records are especially good for this because they are court-written summaries, already concise and accurate.
What it costs to run
The Mac Mini it runs on was already in the closet. The index build took about 20 hours of compute. The embedding model is free. PostgreSQL is free. The total API spend for the project was about $50 in Claude API calls for the autonomous build loop that loaded the data and ran the indexing.
A query costs nothing. No API calls, no per-request charges. Just a database lookup on local hardware. At the traffic levels a public MCP tool would realistically see, I could serve this from my house and not notice.
The distribution idea
Here is the part I have been thinking about. The IVFFlat index organizes the 77 million passages into 8,000 clusters of similar vectors. Each cluster is about 15 megabytes of embedding data. To search the corpus, you only need to check the clusters nearest your query, maybe 5-15 of them depending on how broad your question is.
This maps onto IPFS. Put each cluster on IPFS as a content-addressed block. Publish the cluster centroids (12 megabytes, trivial). Now anyone can run a local MCP tool that:
- Downloads the centroids once (12 MB)
- Embeds their query locally (the model is 90 MB)
- Identifies which clusters to check
- Pulls just those clusters from IPFS (75-225 MB per query)
- Searches locally
- Pulls the full opinion text for the top results (trivial, kilobytes)
After the first few queries in a topic area, the relevant clusters are cached locally. A practicing lawyer who works in employment law would build up maybe 1-2 gigabytes of cached clusters and almost never hit the network again for their domain.
The minimum local footprint is about 100 megabytes. Cold queries pull a few hundred megabytes. Warm queries are instant. No server required. No subscription. No API key. Just peer-to-peer legal research.
The thing that does not exist yet
There is a bigger gap than case law. The State Decoded project, funded by the Knight Foundation in 2011, tried to make all fifty state statutory codes freely available in structured, machine-readable form. It died around 2021. The creator, Waldo Jaquith, moved on to other government technology work.
The reason it died was that normalizing fifty different states’ publishing formats into a common schema was a brutal manual engineering problem. Every state publishes differently. Some only offer PDFs. Some have proprietary web apps. Some contract with LexisNexis to be their “official” publisher, which creates quiet institutional resistance to free alternatives.
In 2026, that normalization problem is trivially solvable. You point an adaptive scraper at a state legislature website, it figures out the structure, extracts the text, and handles the edge cases. The kind of bespoke parser maintenance that burned out one developer working alone is a routine task for an agentic system. The estimated cost to scrape and normalize all fifty states: $50-200 in API spend. The estimated size of all state statutory codes combined: 10-25 gigabytes of plain text.
If you put the case law corpus on one side and the statutory codes on the other, and you cross-link them (this statute was interpreted in these cases, this case discusses these statutes), you have built the core of what LexisNexis and Westlaw sell for thousands of dollars a year per seat. Not the editorial enhancements, not the practice guides. But the primary legal materials, structured and searchable.
Building codes are harder because of the copyright fight over privately authored standards incorporated into law. But the trend in the courts is clear. Georgia v. Public.Resource.Org (2020, SCOTUS): you cannot copyright the law. ASTM v. Public.Resource.Org (2023, DC Circuit): publishing incorporated standards is fair use. The legal ground is being won. The infrastructure to distribute the results does not exist yet.
Who this is for
The primary user is an AI. Specifically: any language model that needs to answer a legal question without hallucinating. The MCP tool sits in the model’s toolkit. When a legal question comes up, the model queries the corpus instead of generating from memory. The human never sees the tool. They just get better answers.
The secondary user is the human who wants their AI to not be terrible at law. A paralegal asking Claude to research precedent. A journalist checking whether a cited case is real. A legal aid attorney who cannot afford Westlaw.
If someone takes this work, builds something better on top of it, and charges money for it, I consider that a win. The corpus is public domain. The embeddings are a commodity operation. The Shepardizer is built from public data. The goal is not a business. The goal is that the law is searchable by anyone, including machines, for free.
Current status
The case law RAG is built and queryable. The IVFFlat index is complete. An MCP server exposes four tools: caselaw_search, caselaw_case, caselaw_shepardize, and caselaw_stats. The IPFS distribution layer is designed but not built. The state statutory code project is a plan, not code.
If you work on access-to-law projects, or if you want an MCP tool that gives your AI actual case law instead of hallucinated citations, I want to hear from you. I am at [email protected].