Skip to main content
Ground truth, bounding boxes, scanned variants.

Pre-labelled synthetic document libraries

Real-looking PDFs at scale. Ground truth, bounding boxes and scanned variants shipped alongside every document. Built by Root Cause Analytics in Sydney.

Real samples

Real pages from the libraries

Browse representative documents from the RCA Insurance and Medical libraries. Same generator stack, different document types. Every page ships with ground truth, bounding boxes, a scanned variant, and a visible synthetic disclaimer.

RCA Insurance LibraryBroker submission email
1 / 5
RCA Insurance Library sample: Broker submission emailRCA Insurance Library sample: Loss run reportRCA Insurance Library sample: Statement of valuesRCA Insurance Library sample: Policy scheduleRCA Insurance Library sample: First notice of loss
RCA Medical LibraryDischarge summary
1 / 5
RCA Medical Library sample: Discharge summaryRCA Medical Library sample: ED assessmentRCA Medical Library sample: Referral letterRCA Medical Library sample: Imaging reportRCA Medical Library sample: Pathology report

Pricing

Same four tiers per library: free sample, paid Sprint or Pilot pack, production library, training library. Every tier ships same-day. The generator is a deterministic Python pipeline that produces a complete library in minutes, not weeks.

RCA Insurance Library

TierSizeBest forPrice
Free sample2 submission packsFirst look. Review the schema and disclaimer.Free
QA Sprint Pack10 submission packs + red flag summary + 30-min handoverPipeline QA. Vendor evaluation.AUD $2,500
Production library100+ submission packsProduction regression suite. Internal QA at scale.Contact for quote
Training library1,000+ submission packs with train / val / test splitsML model fine-tuning at scale.Contact for quote

RCA Medical Library

TierSizeBest forPrice
Free sample25 to 35 documentsFirst look. Review the schema, AU conventions and disclaimer.Free
Pilot pack100 to 200 documents scoped to your specialtyInternal pilot. Specialty-focused review.Contact for quote
Production library500 to 1,000 documents across 40+ typesProduction regression suite. Internal QA at scale.Contact for quote
Training library5,000+ documents with train / val / test splitsML model fine-tuning at scale.Contact for quote

Every order ships same-day. The generator produces ground truth (CSV + JSONL), bounding box records, scanned variants, manifest, and train / val / test splits where applicable, all in one pass. Custom document types or schemas are quoted via RCA Custom Libraries.

What ships with every order

Every PDF lands with its labels. Every library lands with its manifest.

Per document
  • Clean PDF in pdfs/
  • Scanned variant in pdfs_scanned/ (rotation, noise, JPEG)
  • Ground truth row in CSV and JSONL
  • Bounding box record in bboxes.jsonl
  • Visible synthetic disclaimer on every page
Per library
  • manifest.json with document-type distribution and metadata
  • splits.json with train / val / test allocation
  • README.md with schema and regeneration commands
  • validation_summary.md confirming integrity checks
  • license_summary.md confirming the synthetic-only restriction

Try a free preview pack

Two-pack insurance preview, or a 25 to 35 document medical review pack.

Same-day delivery. Direct from Sydney, Australia.