Pre-labelled. Pre-bbox'd. Pre-scanned.

Pre-labelled synthetic document libraries for document AI teams

Real-looking PDFs at scale, with ground truth, bounding boxes and scanned variants shipped alongside every document. Built by Root Cause Analytics in Sydney.

Get a free preview pack See the Insurance Library

The product lines

Three libraries from the same generator stack, plus benchmark packs and custom builds.

RCA Insurance Library

Commercial P&C submission extraction QA and training

Scale: 25, 100, 500, 5,000+ packs

Learn more

RCA Medical Library

Healthcare document extraction QA and training

Scale: 200, 500, 5,000+ documents

Learn more

RCA Benchmark Packs

Procurement evaluation, vendor bake-off, pre-rollout QA

Scale: Smaller curated subsets

Learn more

RCA Custom Libraries

Your document types, your schema, your style profiles

Scale: Scope-dependent

Learn more

What every document ships with

A PDF in pdfs/ (clean, born-digital)
A scanned variant in pdfs_scanned/ (rotation, noise, JPEG artefacts)
A ground truth row in CSV (ground_truth.csv) and JSONL (ground_truth.jsonl)
A bounding box record in bboxes.jsonl with page index and field coordinates
A visible synthetic disclaimer rendered on every page

Insurance documents additionally ship per-claim row bboxes (from claim_rows_json) and per-location row bboxes (from location_rows_json). Reviewers can click through to individual rows on the loss run and statement of values, not just the document-level bbox.

What every library ships with

A library-level manifest.json documenting document type distribution, pack composition (insurance), case mix (medical), red flag inventory (insurance), and per-document metadata
A splits.json with train / val / test allocation by document_id (ships with libraries above the standard QA scale)
A README.md explaining schema, regeneration commands, and the synthetic safety statement
A validation_summary.md confirming PDF / ground truth / bbox / scan integrity

How libraries are built

A deterministic Python generator. Cases are curated by hand, not LLM-generated. Phrase banks supply narrative variety. Style profiles and template families control visual variety so models trained on the library cannot memorise a single layout. Seeds are reproducible: the same seed produces the same PDFs every time.

What you can do

Train, fine-tune, evaluate or QA document AI models
Stress-test extraction pipelines against varied layouts and scanned input
Demonstrate an internal extraction system to stakeholders using safe data
Run a procurement evaluation: shortlist vendors against the same ground truth
Build a regression suite for an existing extraction pipeline that is currently un-tested

What libraries are not

They are not real patient or claimant data. They are not de-identified records. Nothing here is real.
They are not validated for clinical care, claims handling, underwriting, accounting, regulatory or legal use.
They are not statistically representative of any specific hospital, broker, insurer book or jurisdiction beyond the conventions documented in the README.

Get a free preview pack

Two-pack insurance preview, or a 25 to 35 document medical review pack. Five-minute review path documented in the pack README.

Request a preview pack See Benchmark Packs