Skip to main content
Synthetic. Not real broker, insurer or claimant data.

RCA Insurance Library

Synthetic commercial P&C submission packs

Pre-labelled broker submissions for QA, evaluation and training of document extraction pipelines. Built and shipped by Root Cause Analytics.

What is in the library

Each submission pack is a complete broker submission as you would receive it in a real underwriting inbox: cover note, attachments, supporting forms. Pack composition varies by submission type (new business, renewal with claims, FNOL).

Broker submission email

Cover note, named attachments, broker signature block

Loss run report

Last 5 years of claims, per-claim rows, displayed totals, status

Statement of values

Per-location rows, building, contents and BI values, displayed totals

Policy schedule

Insurer schedule with limits, deductibles, endorsements

Certificate of currency

Broker-issued confirmation of cover

Insurance application

New business questionnaire

FNOL form

First notice of loss form

Claim report

Incumbent renewal claim narrative

Engineered red flags

A subset of packs are deliberately broken: cross-document inconsistencies we have seen in real submissions, engineered in at known positions so your extraction or validation pipeline has a controlled target to flag.

Loss run total mismatch

Displayed total disagrees with the sum of the claim rows

Statement of values total mismatch

Displayed total disagrees with the sum of the location rows

Missing attachment

The broker email lists a doc that is not in the pack

ABN formatting inconsistency

Same ABN formatted differently across documents in the same pack

Policy number mismatch

Certificate of currency disagrees with the policy schedule

Location address mismatch

Statement of values address disagrees with the policy schedule

Claim after policy end

A loss date is outside the policy period

Currency mismatch

A non-AUD currency on a single location row inside an otherwise AUD submission

Red flag inventory ships as red_flags_summary.csv with each pack. The CSV includes a where_to_review column pointing to the two documents to compare. This file is the most useful artefact for QA workflows.

Bbox structure: per-row, not just per-document

Most synthetic libraries return one bounding box per document. The RCA Insurance Library returns a bbox for every labelled field in the document, plus a per-row bbox for every claim in claim_rows_json and every location in location_rows_json.

A LayoutLMv3 or Donut fine-tune learns per-claim and per-location supervision. A reviewer can click any row in the structured ground truth and highlight the exact pixels on the rendered PDF.

loss_run_report
clean PDF
Clean synthetic loss run report PDF showing the insured business, ABN, policy number, period dates, displayed totals, and four claim rows with category, description, status, paid, reserve and incurred columns. Visible synthetic disclaimer in the header and footer.
labelled fields overlay
66 bboxes
The same loss run report with every labelled field outlined. Red outlines mark document-level scalars. Teal outlines mark per-row entries from claim_rows_json. Each of the four claim rows has its own eight sub-key bboxes.

Same shape on statements of values

Per-location rows from location_rows_json get the same treatment. Each address, occupancy, building value, contents value, stock value and BI value lands as its own bbox keyed by row index. A statement of values with five sites ships roughly 41 labelled-field bboxes.

statement_of_values
clean PDF
Clean synthetic statement of values PDF showing the insured business and four location rows with address, occupancy, building value, contents value, stock value, BI value and declared total per site.
labelled fields overlay
41 bboxes
The same statement of values PDF with every labelled field outlined in red and teal. Per-location entries from location_rows_json each have their own bbox per sub-key (address, occupancy, building_value, contents_value, stock_value, business_interruption_value).
Red outlines
Document-level scalar fields: identity, policy, period dates, displayed totals.
Teal outlines
Per-row entries from claim_rows_json and location_rows_json. Each row has its sub-keys preserved with row_index.
Footnote
Bbox coordinates are PDF points. When a value wraps across two lines in a narrow table cell, the bbox spans both lines using a word-grouped fallback.

Diversity controls

Each PDF is rendered with a deterministically chosen style profile, each modelled on a real underwriting-inbox archetype:

broker_formalbroker_modernbroker_email_printoutinsurer_legacy_systemunderwriting_agency_cleanbordereaux_likespreadsheet_exportclaims_system_export

Each document type has three named template families that vary header / footer / section ordering without changing field labels or ground truth values. The chosen profile and family are recorded per row in the ground truth.

Pricing

TierScalePriceDelivery
Free preview2 packsFreeSame day on request
QA Sprint Pack10 packs + red flag summary + 30-min handoverAUD $2,50048 to 72 hours
QA library25, 100, 500 packsOn requestScoped per order
Bulk training library5,000+ packsOn requestScoped per order
Custom variantsYour document types or red flag setOn requestScoped per order

Synthetic safety

Every PDF carries a visible synthetic disclaimer on every page. All broker names, insurer names, insured business names, ABNs, addresses, phone numbers, policy numbers, claim numbers and dollar values are computer-generated and do not refer to any real organisation, broker, insurer or claim.

Not for underwriting, claims handling, accounting, or regulatory use.

Try the free 2-pack preview

Two complete submission packs, ground truth, bboxes and scanned variants. The pack ships with a five-minute review path.