Roadmap

What we're processing, what's live, and what's coming. We build in the open.

Datasets

DOJ Full Dump In Progress
3.5 million pages across 12 datasets. Data Set 4 (176 documents) is fully ingested; remaining 11 sets in queue. FBI records, estate documents, travel logs, bank reports, and more.
U.S. House Oversight Materials Ingested
Court filings and oversight committee releases. Fully indexed with entity extraction.
FBI Records In Progress
FOIA-released FBI materials related to the Epstein investigation.
U.S. Customs & Border Protection Planned
Travel records and border crossing data from FOIA requests.
Estate Document Dumps In Progress
Emails, spreadsheets, videos, and images from the Epstein estate productions.

Features — Live

Hybrid Search Live
BM25 keyword search combined with dense vector retrieval for best-of-both-worlds accuracy.
Knowledge Graph Live
6,100+ entities (people, organizations, locations, dates) extracted and resolved across all documents.
Citation-Backed AI Responses Live
Every answer links back to the source documents it was derived from. No hallucinated claims.
Entity Extraction & Resolution Live
Automatic identification and deduplication of people, organizations, and locations across documents.
Document Browsing & Download Live
Browse indexed documents, view metadata, and get presigned S3 download links for originals.
Entity Search Live
Search 6,100+ entities by name with relevance scoring. Browse people, organizations, and locations across all documents.
API Access Live
Public REST API at api.epsteinfilesarchive.com for programmatic access to search, documents, and entities.
Chat Interface Live
Ask questions in natural language and get citation-backed answers from the full archive.
Chat Query Logging Live
All queries are logged to DynamoDB for analytics and search quality improvement.
Cohere Reranking Live
Search results are reranked using Cohere's reranking model for improved relevance ordering.

Features — In Progress

Country / Region Search In Progress
Filter documents by country mentions. Built for international researchers (driven by Indonesian press demand).
Person Search with Disambiguation In Progress
Search by person name with entity resolution to handle aliases and name variants.

Features — Planned

Press / Journalist Tools Planned
Embeddable citations, API access, and export tools for journalists writing stories from the archive.
Alert System Planned
Email alerts when new documents are ingested that mention a specific country, person, or topic.
Mobile PWA Planned
Progressive web app for mobile-first access to the archive and chat.
Curated Collections Planned
Hand-curated document sets organized by theme, person, or investigation thread.

Known Issues

Entity Resolution: Suspicious Merges Under Review
~658 entity merge clusters flagged for manual review. Some entities may be incorrectly grouped together. Being audited before the next data ingestion round.
Entity Resolution: Location Merge Errors Under Review
Some location entities were incorrectly merged (e.g., "Poland" grouped under "Miami" cluster). Fix requires Neo4j graph surgery — scheduled for next entity resolution pass.
Entity Resolution: Date Chain Collapse Under Review
September date entities (9 distinct dates) were collapsed into a single entity. Will be split back into individual date entities during the next resolution run.

Scale-Up: Remaining DOJ Data Sets

The 11 remaining data sets represent the bulk of the archive. Ingesting these is the highest-impact next step.

Data Sets 1–3 Queued
Early estate productions — emails, financial records, and contact lists.
Data Sets 5–8 Queued
Includes videos, images, spreadsheets, and additional correspondence.
Data Sets 9–12 Queued
Later productions including FBI materials, travel records, and bank statements.

Stay Updated

Get notified when new features ship and new datasets are processed.