Skip to main content
Pre-labelled. Pre-bbox'd. Pre-scanned.

Pre-labelled synthetic document libraries for document AI teams

Real-looking PDFs at scale, with ground truth, bounding boxes and scanned variants shipped alongside every document. Built by Root Cause Analytics in Sydney.

What every document ships with

  • A PDF in pdfs/ (clean, born-digital)
  • A scanned variant in pdfs_scanned/ (rotation, noise, JPEG artefacts)
  • A ground truth row in CSV (ground_truth.csv) and JSONL (ground_truth.jsonl)
  • A bounding box record in bboxes.jsonl with page index and field coordinates
  • A visible synthetic disclaimer rendered on every page

Insurance documents additionally ship per-claim row bboxes (from claim_rows_json) and per-location row bboxes (from location_rows_json). Reviewers can click through to individual rows on the loss run and statement of values, not just the document-level bbox.

What every library ships with

  • A library-level manifest.json documenting document type distribution, pack composition (insurance), case mix (medical), red flag inventory (insurance), and per-document metadata
  • A splits.json with train / val / test allocation by document_id (ships with libraries above the standard QA scale)
  • A README.md explaining schema, regeneration commands, and the synthetic safety statement
  • A validation_summary.md confirming PDF / ground truth / bbox / scan integrity

How libraries are built

A deterministic Python generator. Cases are curated by hand, not LLM-generated. Phrase banks supply narrative variety. Style profiles and template families control visual variety so models trained on the library cannot memorise a single layout. Seeds are reproducible: the same seed produces the same PDFs every time.

What you can do

  • Train, fine-tune, evaluate or QA document AI models
  • Stress-test extraction pipelines against varied layouts and scanned input
  • Demonstrate an internal extraction system to stakeholders using safe data
  • Run a procurement evaluation: shortlist vendors against the same ground truth
  • Build a regression suite for an existing extraction pipeline that is currently un-tested

What libraries are not

  • They are not real patient or claimant data. They are not de-identified records. Nothing here is real.
  • They are not validated for clinical care, claims handling, underwriting, accounting, regulatory or legal use.
  • They are not statistically representative of any specific hospital, broker, insurer book or jurisdiction beyond the conventions documented in the README.

Get a free preview pack

Two-pack insurance preview, or a 25 to 35 document medical review pack. Five-minute review path documented in the pack README.