Back to Resources
TUTORIAL

Schema Design Best Practices for Document Extraction

By Alex RiveraOctober 28, 202410 min read

The difference between a good extraction schema and a bad one is the difference between 95% accuracy and 70%. A well-designed schema guides the AI to extract exactly what you need, nothing more, nothing less.

The Golden Rule: Specificity > Generality

The most common mistake is creating overly generic schemas. Instead of asking for "date", specify "invoice_date", "due_date", or "service_date". The more specific you are, the better the extraction.

❌ Bad Schema
{
  "date": "string",
  "amount": "number",
  "name": "string"
}
Too vague. Which date? Whose name? What amount?
✅ Good Schema
{
  "invoice_date": "date",
  "total_amount": "float",
  "vendor_name": "string",
  "customer_name": "string"
}
Clear, specific, unambiguous.

Use Enums for Categorical Fields

When a field has a limited set of possible values, always use an enum. This dramatically improves accuracy and makes validation easier.

{
  "document_type": {
    "type": "enum",
    "values": ["W-2", "1099-NEC", "1099-MISC", "W-9"],
    "description": "The IRS form type"
  },
  "payment_method": {
    "type": "enum",
    "values": ["ACH", "Wire", "Check", "Credit Card"],
    "description": "How payment was made"
  },
  "insurance_type": {
    "type": "enum",
    "values": ["Health", "Dental", "Vision", "Life", "Disability"]
  }
}

Add Descriptions (Seriously)

Never skip field descriptions. They're not just documentation — they're instructions for the AI. A good description can be the difference between correct and incorrect extraction.

{
  "gross_income": {
    "type": "float",
    "description": "Total income before taxes and deductions (Box 1 on W-2)"
  },
  "employer_ein": {
    "type": "string",
    "description": "Employer Identification Number, 9-digit format XX-XXXXXXX"
  },
  "pay_period_end": {
    "type": "date",
    "description": "Last day of the pay period, not the payment date"
  }
}

Nested Objects for Complex Data

Don't flatten everything to top-level fields. Use nested objects to represent real-world relationships.

{
  "invoice_number": "string",
  "invoice_date": "date",
  "vendor": {
    "name": "string",
    "address": "string",
    "tax_id": "string",
    "contact": {
      "name": "string",
      "email": "string",
      "phone": "string"
    }
  },
  "line_items": [
    {
      "description": "string",
      "quantity": "float",
      "unit_price": "float",
      "total": "float"
    }
  ],
  "totals": {
    "subtotal": "float",
    "tax": "float",
    "shipping": "float",
    "total": "float"
  }
}

Real-World Examples

Financial Services: Pay Stub Schema

{
  "employee": {
    "name": "string",
    "employee_id": "string",
    "ssn_last_4": "string"
  },
  "employer": {
    "name": "string",
    "address": "string",
    "ein": "string"
  },
  "pay_period": {
    "start_date": "date",
    "end_date": "date",
    "pay_date": "date"
  },
  "earnings": {
    "regular_hours": "float",
    "overtime_hours": "float",
    "regular_rate": "float",
    "overtime_rate": "float",
    "gross_pay": "float"
  },
  "deductions": {
    "federal_tax": "float",
    "state_tax": "float",
    "social_security": "float",
    "medicare": "float",
    "health_insurance": "float",
    "retirement_401k": "float"
  },
  "ytd": {
    "gross_earnings": "float",
    "federal_tax": "float",
    "state_tax": "float",
    "net_pay": "float"
  },
  "net_pay": "float"
}

Healthcare: Medical Record Schema

{
  "patient": {
    "name": "string",
    "dob": "date",
    "mrn": "string",
    "insurance_id": "string"
  },
  "visit": {
    "date": "date",
    "provider": "string",
    "facility": "string",
    "visit_type": {
      "type": "enum",
      "values": ["Inpatient", "Outpatient", "Emergency", "Telehealth"]
    }
  },
  "vitals": {
    "blood_pressure_systolic": "integer",
    "blood_pressure_diastolic": "integer",
    "heart_rate": "integer",
    "temperature": "float",
    "respiratory_rate": "integer"
  },
  "diagnoses": [
    {
      "icd_10_code": "string",
      "description": "string",
      "type": {
        "type": "enum",
        "values": ["Primary", "Secondary", "Comorbidity"]
      }
    }
  ],
  "medications": [
    {
      "name": "string",
      "dosage": "string",
      "frequency": "string",
      "route": "string"
    }
  ]
}

Common Pitfalls to Avoid

🚨 Don't Mix Semantic Levels

Bad: { "name": "...", "address_line_1": "..." }
Good: { "name": "...", "address": { "line_1": "..." } }

🚨 Don't Use Ambiguous Names

Bad: "value", "amount", "total"
Good: "invoice_total", "line_item_amount", "tax_value"

🚨 Don't Over-Nest

If you're 5+ levels deep, you're probably doing it wrong. Aim for 2-3 levels max.

Testing Your Schema

Before deploying to production:

  1. Test with 20-50 real documents
  2. Check for null/missing values — adjust descriptions if needed
  3. Look for fields that are consistently wrong — they need better descriptions
  4. Validate against edge cases (handwritten, poor scans, unusual formats)
  5. Use Retriv's confidence scores to identify problematic fields

Need Help with Your Schema?

Our solutions engineers can review your schema and provide recommendations.

Get Schema Review
AR
Alex Rivera
Solutions Engineer @ Retriv.ai
Previously: Data Engineering @ Stripe