The difference between a good extraction schema and a bad one is the difference between 95% accuracy and 70%. A well-designed schema guides the AI to extract exactly what you need, nothing more, nothing less.
The Golden Rule: Specificity > Generality
The most common mistake is creating overly generic schemas. Instead of asking for "date", specify "invoice_date", "due_date", or "service_date". The more specific you are, the better the extraction.
{
"date": "string",
"amount": "number",
"name": "string"
}{
"invoice_date": "date",
"total_amount": "float",
"vendor_name": "string",
"customer_name": "string"
}Use Enums for Categorical Fields
When a field has a limited set of possible values, always use an enum. This dramatically improves accuracy and makes validation easier.
{
"document_type": {
"type": "enum",
"values": ["W-2", "1099-NEC", "1099-MISC", "W-9"],
"description": "The IRS form type"
},
"payment_method": {
"type": "enum",
"values": ["ACH", "Wire", "Check", "Credit Card"],
"description": "How payment was made"
},
"insurance_type": {
"type": "enum",
"values": ["Health", "Dental", "Vision", "Life", "Disability"]
}
}Add Descriptions (Seriously)
Never skip field descriptions. They're not just documentation — they're instructions for the AI. A good description can be the difference between correct and incorrect extraction.
{
"gross_income": {
"type": "float",
"description": "Total income before taxes and deductions (Box 1 on W-2)"
},
"employer_ein": {
"type": "string",
"description": "Employer Identification Number, 9-digit format XX-XXXXXXX"
},
"pay_period_end": {
"type": "date",
"description": "Last day of the pay period, not the payment date"
}
}Nested Objects for Complex Data
Don't flatten everything to top-level fields. Use nested objects to represent real-world relationships.
{
"invoice_number": "string",
"invoice_date": "date",
"vendor": {
"name": "string",
"address": "string",
"tax_id": "string",
"contact": {
"name": "string",
"email": "string",
"phone": "string"
}
},
"line_items": [
{
"description": "string",
"quantity": "float",
"unit_price": "float",
"total": "float"
}
],
"totals": {
"subtotal": "float",
"tax": "float",
"shipping": "float",
"total": "float"
}
}Real-World Examples
Financial Services: Pay Stub Schema
{
"employee": {
"name": "string",
"employee_id": "string",
"ssn_last_4": "string"
},
"employer": {
"name": "string",
"address": "string",
"ein": "string"
},
"pay_period": {
"start_date": "date",
"end_date": "date",
"pay_date": "date"
},
"earnings": {
"regular_hours": "float",
"overtime_hours": "float",
"regular_rate": "float",
"overtime_rate": "float",
"gross_pay": "float"
},
"deductions": {
"federal_tax": "float",
"state_tax": "float",
"social_security": "float",
"medicare": "float",
"health_insurance": "float",
"retirement_401k": "float"
},
"ytd": {
"gross_earnings": "float",
"federal_tax": "float",
"state_tax": "float",
"net_pay": "float"
},
"net_pay": "float"
}Healthcare: Medical Record Schema
{
"patient": {
"name": "string",
"dob": "date",
"mrn": "string",
"insurance_id": "string"
},
"visit": {
"date": "date",
"provider": "string",
"facility": "string",
"visit_type": {
"type": "enum",
"values": ["Inpatient", "Outpatient", "Emergency", "Telehealth"]
}
},
"vitals": {
"blood_pressure_systolic": "integer",
"blood_pressure_diastolic": "integer",
"heart_rate": "integer",
"temperature": "float",
"respiratory_rate": "integer"
},
"diagnoses": [
{
"icd_10_code": "string",
"description": "string",
"type": {
"type": "enum",
"values": ["Primary", "Secondary", "Comorbidity"]
}
}
],
"medications": [
{
"name": "string",
"dosage": "string",
"frequency": "string",
"route": "string"
}
]
}Common Pitfalls to Avoid
🚨 Don't Mix Semantic Levels
Bad: { "name": "...", "address_line_1": "..." }
Good: { "name": "...", "address": { "line_1": "..." } }
🚨 Don't Use Ambiguous Names
Bad: "value", "amount", "total"
Good: "invoice_total", "line_item_amount", "tax_value"
🚨 Don't Over-Nest
If you're 5+ levels deep, you're probably doing it wrong. Aim for 2-3 levels max.
Testing Your Schema
Before deploying to production:
- Test with 20-50 real documents
- Check for null/missing values — adjust descriptions if needed
- Look for fields that are consistently wrong — they need better descriptions
- Validate against edge cases (handwritten, poor scans, unusual formats)
- Use Retriv's confidence scores to identify problematic fields
Need Help with Your Schema?
Our solutions engineers can review your schema and provide recommendations.
Get Schema Review