Pre-labelled synthetic document libraries for document AI teams
Real-looking PDFs at scale, with ground truth, bounding boxes and scanned variants shipped alongside every document. Built by Root Cause Analytics in Sydney.
The product lines
Three libraries from the same generator stack, plus benchmark packs and custom builds.
RCA Insurance Library
Commercial P&C submission extraction QA and training
Scale: 25, 100, 500, 5,000+ packs
Learn moreRCA Medical Library
Healthcare document extraction QA and training
Scale: 200, 500, 5,000+ documents
Learn moreRCA Benchmark Packs
Procurement evaluation, vendor bake-off, pre-rollout QA
Scale: Smaller curated subsets
Learn moreRCA Custom Libraries
Your document types, your schema, your style profiles
Scale: Scope-dependent
Learn moreWhat every document ships with
- A PDF in pdfs/ (clean, born-digital)
- A scanned variant in pdfs_scanned/ (rotation, noise, JPEG artefacts)
- A ground truth row in CSV (ground_truth.csv) and JSONL (ground_truth.jsonl)
- A bounding box record in bboxes.jsonl with page index and field coordinates
- A visible synthetic disclaimer rendered on every page
Insurance documents additionally ship per-claim row bboxes (from claim_rows_json) and per-location row bboxes (from location_rows_json). Reviewers can click through to individual rows on the loss run and statement of values, not just the document-level bbox.
What every library ships with
- A library-level manifest.json documenting document type distribution, pack composition (insurance), case mix (medical), red flag inventory (insurance), and per-document metadata
- A splits.json with train / val / test allocation by document_id (ships with libraries above the standard QA scale)
- A README.md explaining schema, regeneration commands, and the synthetic safety statement
- A validation_summary.md confirming PDF / ground truth / bbox / scan integrity
How libraries are built
A deterministic Python generator. Cases are curated by hand, not LLM-generated. Phrase banks supply narrative variety. Style profiles and template families control visual variety so models trained on the library cannot memorise a single layout. Seeds are reproducible: the same seed produces the same PDFs every time.
What you can do
- Train, fine-tune, evaluate or QA document AI models
- Stress-test extraction pipelines against varied layouts and scanned input
- Demonstrate an internal extraction system to stakeholders using safe data
- Run a procurement evaluation: shortlist vendors against the same ground truth
- Build a regression suite for an existing extraction pipeline that is currently un-tested
What libraries are not
- They are not real patient or claimant data. They are not de-identified records. Nothing here is real.
- They are not validated for clinical care, claims handling, underwriting, accounting, regulatory or legal use.
- They are not statistically representative of any specific hospital, broker, insurer book or jurisdiction beyond the conventions documented in the README.
Get a free preview pack
Two-pack insurance preview, or a 25 to 35 document medical review pack. Five-minute review path documented in the pack README.