Skip to content

Data Models

Document Extraction Tools uses strongly-typed Pydantic models throughout the pipeline.

Core Types

PathIdentifier

A uniform handle for file locations with optional context:

from document_extraction_tools.types import PathIdentifier

path = PathIdentifier(
    path="/data/invoices/invoice_001.pdf",
    metadata={"source": "email", "received_date": "2024-01-15"}
)

DocumentBytes

Raw bytes with MIME type and source information:

from document_extraction_tools.types import DocumentBytes

doc_bytes = DocumentBytes(
    file_bytes=raw_bytes,
    path_identifier=path_identifier,
)

TextData

Encapsulates textual content for a page:

from document_extraction_tools.types import TextData

text_data = TextData(content="Invoice #12345...")

ImageData

Encapsulates image content in various formats (bytes, PIL Image, or NumPy array):

from document_extraction_tools.types import ImageData

# From raw bytes
image_data = ImageData(content=raw_image_bytes)

# Or from PIL Image
from PIL import Image
image_data = ImageData(content=Image.open("page.png"))

# Or from NumPy array
import numpy as np
image_data = ImageData(content=np.array(...))

Page

Represents a single page within a document:

from document_extraction_tools.types import Page, TextData, ImageData

# Text page
text_page = Page(
    page_number=1,
    data=TextData(content="Invoice #12345..."),
)

# Image page
image_page = Page(
    page_number=1,
    data=ImageData(content=image_bytes),
)

Document

Parsed document with pages, content, and metadata:

from document_extraction_tools.types import Document, Page, TextData

document = Document(
    id="doc-001",
    path_identifier=path_identifier,
    pages=[
        Page(
            page_number=1,
            data=TextData(content="Invoice #12345..."),
        )
    ],
    content_type="text",
    metadata={"page_count": 1},
)

Content Type Validation

The Document model validates that all page data types match the declared content_type. If content_type is "text", all pages must contain TextData. If content_type is "image", all pages must contain ImageData.

ExtractionSchema

Your custom Pydantic model defining the target output structure:

from pydantic import BaseModel, Field

class InvoiceSchema(BaseModel):
    invoice_id: str = Field(..., description="Unique invoice identifier")
    vendor: str = Field(..., description="Vendor name")
    total: float = Field(..., description="Total amount")

Evaluation Types

EvaluationExample

A ground truth + file path pair for evaluation:

from document_extraction_tools.types import EvaluationExample, ExtractionResult, PathIdentifier

example = EvaluationExample(
    id="/data/test/invoice_001.pdf",
    path_identifier=PathIdentifier(path="/data/test/invoice_001.pdf"),
    true=ExtractionResult(
        data=InvoiceSchema(
            invoice_id="12345",
            vendor="Acme Corp",
            total=1500.00
        ),
    ),
)

EvaluationResult

Result from an evaluator:

from document_extraction_tools.types import EvaluationResult

result = EvaluationResult(
    name="field_accuracy",
    result=0.95,
    description="Percentage of fields correctly extracted"
)

Type Safety

All models are Pydantic BaseModels, providing:

  • Validation - Automatic type checking and coercion
  • Serialization - Easy JSON/dict conversion
  • Documentation - Field descriptions for clarity
  • IDE Support - Full autocomplete and type hints