Synthetic CMS-1500 Health Insurance Claim Form Data

Synthetic training data — no real PII, fully coherent identities

healthcare2024

Generate synthetic CMS-1500 health insurance claim forms with realistic patient demographics, diagnosis codes, and procedure data. The standard healthcare claim form used across the U.S. — essential training data for healthcare document AI and claims processing pipelines.

68

Fields per document

1

Page

1

Credit per identity

healthcare

Category

What this document is

The CMS-1500 is the universal health insurance claim form used by physicians, clinics, and healthcare providers to bill Medicare, Medicaid, and private insurers across the United States. Its single-page layout packs 33 numbered fields covering patient information, insured party details, diagnosis codes (ICD-10), procedure codes (CPT/HCPCS), and billing amounts into a dense, red-dropout grid designed for optical scanning.

Why generate synthetically

Healthcare claims processing is one of the largest document automation markets, and the CMS-1500 is its foundational form. Synthetic CMS-1500s enable training OCR, field extraction, and claims validation models without the HIPAA compliance burden of using real patient health information. They are essential for any healthcare IDP pipeline.

What makes synthetic data useful

Each synthetic CMS-1500 generates a coherent patient encounter where diagnosis codes (Box 21) relate to the procedure codes (Box 24D), dates of service are consistent, and billing amounts reflect realistic charge patterns for the given procedures. Provider NPIs follow valid formats, and patient demographics use fabricated but structurally valid data including insurance ID numbers.

Training challenges

The CMS-1500's red-dropout ink is designed for OCR scanners but paradoxically causes problems for modern ML models trained on standard black-and-white documents: the red grid lines appear in color scans but vanish in B&W, creating two very different visual inputs from the same form. Box 24 (the service line table) packs up to 6 rows of procedure codes, modifiers, diagnosis pointers, and charges into a grid where column widths vary and cell boundaries are defined only by the red dropout pattern. Boxes 1-13 (patient/insured information) use a split-column layout where the left and right halves of the page collect parallel data for patient vs. insured party, and models must correctly assign fields to the right entity. The diagnosis pointer field (Box 24E) uses single-letter references (A-L) that cross-reference Box 21's diagnosis codes, requiring relational extraction across non-adjacent form regions.

Generate synthetic CMS-1500 Health Insurance Claim Form data

Start with 250 free credits. No credit card required.

Generate Now

Frequently asked questions

What data format do synthetic CMS-1500 documents include?
Each generated identity produces a filled PDF and a structured JSON annotation file containing bounding boxes and field values for all 68 fields on the single-page form.
Can I use this data commercially?
Yes. All synthetic data is generated from statistical models, contains no real PHI or PII, and is licensed for commercial use including ML model training and benchmarking.
How does the synthetic data differ from real CMS-1500 forms?
Synthetic CMS-1500s use fabricated patient identities, provider NPIs, and diagnosis/procedure code combinations. The data is clinically plausible but does not come from real patient encounters.
Does the synthetic data comply with HIPAA?
Yes. Because all data is synthetically generated with no real patient information, there are no HIPAA restrictions on storage, sharing, or use in model training.
Are the diagnosis and procedure codes realistic?
Yes. The generator uses valid ICD-10 diagnosis codes and CPT/HCPCS procedure codes in clinically plausible combinations, so models learn real-world code patterns without memorizing synthetic-only sequences.