Data Models¶

Document Extraction Tools uses strongly-typed Pydantic models throughout the pipeline.

Core Types¶

PathIdentifier¶

A uniform handle for file locations with optional context:

from document_extraction_tools.types import PathIdentifier

path = PathIdentifier(
    path="/data/invoices/invoice_001.pdf",
    metadata={"source": "email", "received_date": "2024-01-15"}
)

DocumentBytes¶

Raw bytes with MIME type and source information:

from document_extraction_tools.types import DocumentBytes

doc_bytes = DocumentBytes(
    file_bytes=raw_bytes,
    path_identifier=path_identifier,
)

TextData¶

Encapsulates textual content for a page:

from document_extraction_tools.types import TextData

text_data = TextData(content="Invoice #12345...")

ImageData¶

Encapsulates image content in various formats (bytes, PIL Image, or NumPy array):

from document_extraction_tools.types import ImageData

# From raw bytes
image_data = ImageData(content=raw_image_bytes)

# Or from PIL Image
from PIL import Image
image_data = ImageData(content=Image.open("page.png"))

# Or from NumPy array
import numpy as np
image_data = ImageData(content=np.array(...))

Page¶

Represents a single page within a document:

from document_extraction_tools.types import Page, TextData, ImageData

# Text page
text_page = Page(
    page_number=1,
    data=TextData(content="Invoice #12345..."),
)

# Image page
image_page = Page(
    page_number=1,
    data=ImageData(content=image_bytes),
)

Document¶

Parsed document with pages, content, and metadata:

from document_extraction_tools.types import Document, Page, TextData

document = Document(
    id="doc-001",
    path_identifier=path_identifier,
    pages=[
        Page(
            page_number=1,
            data=TextData(content="Invoice #12345..."),
        )
    ],
    content_type="text",
    metadata={"page_count": 1},
)

Content Type Validation

The Document model validates that all page data types match the declared content_type. If content_type is "text", all pages must contain TextData. If content_type is "image", all pages must contain ImageData.

ExtractionSchema¶

Your custom Pydantic model defining the target output structure:

from pydantic import BaseModel, Field

class InvoiceSchema(BaseModel):
    invoice_id: str = Field(..., description="Unique invoice identifier")
    vendor: str = Field(..., description="Vendor name")
    total: float = Field(..., description="Total amount")

Evaluation Types¶

EvaluationExample¶

A ground truth + file path pair for evaluation:

from document_extraction_tools.types import EvaluationExample, ExtractionResult, PathIdentifier

example = EvaluationExample(
    id="/data/test/invoice_001.pdf",
    path_identifier=PathIdentifier(path="/data/test/invoice_001.pdf"),
    true=ExtractionResult(
        data=InvoiceSchema(
            invoice_id="12345",
            vendor="Acme Corp",
            total=1500.00
        ),
    ),
)

EvaluationResult¶

Result from an evaluator:

from document_extraction_tools.types import EvaluationResult

result = EvaluationResult(
    name="field_accuracy",
    result=0.95,
    description="Percentage of fields correctly extracted"
)

Type Safety¶

All models are Pydantic BaseModels, providing:

Validation - Automatic type checking and coercion
Serialization - Easy JSON/dict conversion
Documentation - Field descriptions for clarity
IDE Support - Full autocomplete and type hints