Overview¶

Document Extraction Tools provides a modular framework for building document extraction and evaluation pipelines.

Architecture¶

The library is built around these core principles:

Interface-Driven Design - Abstract base classes define contracts for each component
Pluggable Components - You implement the interfaces for your specific use case
Concurrent Execution - Orchestrators handle parallelism and async I/O
Configuration as Code - Pydantic models + YAML for reproducible pipelines

Transforms raw documents into structured data:

flowchart LR
    FL[FileLister] --> R[Reader] --> C[Converter] --> E[Extractor] --> EX[Exporter]

Measures extraction quality against ground truth:

flowchart LR
    TDL[TestDataLoader] --> R[Reader] --> C[Converter] --> E[Extractor] --> EV[Evaluator] --> EX[Exporter]

The orchestrators optimize for both CPU-bound and I/O-bound operations:

# Tuning options in orchestrator config
max_workers: 4        # Thread pool size
max_concurrency: 10   # Async semaphore limit