Roadmap
What we're processing, what's live, and what's coming. We build in the open.
Datasets
DOJ Full Dump
In Progress
3.5 million pages across 12 datasets. Data Set 4 (176 documents) is fully ingested; remaining 11 sets in queue. FBI records, estate documents, travel logs, bank reports, and more.
U.S. House Oversight Materials
Ingested
Court filings and oversight committee releases. Fully indexed with entity extraction.
FBI Records
In Progress
FOIA-released FBI materials related to the Epstein investigation.
U.S. Customs & Border Protection
Planned
Travel records and border crossing data from FOIA requests.
Estate Document Dumps
In Progress
Emails, spreadsheets, videos, and images from the Epstein estate productions.
Features — Live
Hybrid Search
Live
BM25 keyword search combined with dense vector retrieval for best-of-both-worlds accuracy.
Knowledge Graph
Live
6,100+ entities (people, organizations, locations, dates) extracted and resolved across all documents.
Citation-Backed AI Responses
Live
Every answer links back to the source documents it was derived from. No hallucinated claims.
Entity Extraction & Resolution
Live
Automatic identification and deduplication of people, organizations, and locations across documents.
Document Browsing & Download
Live
Browse indexed documents, view metadata, and get presigned S3 download links for originals.
Entity Search
Live
Search 6,100+ entities by name with relevance scoring. Browse people, organizations, and locations across all documents.
API Access
Live
Public REST API at api.epsteinfilesarchive.com for programmatic access to search, documents, and entities.
Chat Interface
Live
Ask questions in natural language and get citation-backed answers from the full archive.
Chat Query Logging
Live
All queries are logged to DynamoDB for analytics and search quality improvement.
Cohere Reranking
Live
Search results are reranked using Cohere's reranking model for improved relevance ordering.
Features — In Progress
Country / Region Search
In Progress
Filter documents by country mentions. Built for international researchers (driven by Indonesian press demand).
Person Search with Disambiguation
In Progress
Search by person name with entity resolution to handle aliases and name variants.
Features — Planned
Press / Journalist Tools
Planned
Embeddable citations, API access, and export tools for journalists writing stories from the archive.
Alert System
Planned
Email alerts when new documents are ingested that mention a specific country, person, or topic.
Mobile PWA
Planned
Progressive web app for mobile-first access to the archive and chat.
Curated Collections
Planned
Hand-curated document sets organized by theme, person, or investigation thread.
Known Issues
Entity Resolution: Suspicious Merges
Under Review
~658 entity merge clusters flagged for manual review. Some entities may be incorrectly grouped together. Being audited before the next data ingestion round.
Entity Resolution: Location Merge Errors
Under Review
Some location entities were incorrectly merged (e.g., "Poland" grouped under "Miami" cluster). Fix requires Neo4j graph surgery — scheduled for next entity resolution pass.
Entity Resolution: Date Chain Collapse
Under Review
September date entities (9 distinct dates) were collapsed into a single entity. Will be split back into individual date entities during the next resolution run.
Scale-Up: Remaining DOJ Data Sets
The 11 remaining data sets represent the bulk of the archive. Ingesting these is the highest-impact next step.
Data Sets 1–3
Queued
Early estate productions — emails, financial records, and contact lists.
Data Sets 5–8
Queued
Includes videos, images, spreadsheets, and additional correspondence.
Data Sets 9–12
Queued
Later productions including FBI materials, travel records, and bank statements.