Overview¶
Document Extraction Tools provides a modular framework for building document extraction and evaluation pipelines.
Architecture¶
The library is built around these core principles:
- Interface-Driven Design - Abstract base classes define contracts for each component
- Pluggable Components - You implement the interfaces for your specific use case
- Concurrent Execution - Orchestrators handle parallelism and async I/O
- Configuration as Code - Pydantic models + YAML for reproducible pipelines
Pipeline Types¶
Extraction Pipeline¶
Transforms raw documents into structured data:
flowchart LR
FL[FileLister] --> R[Reader] --> C[Converter] --> E[Extractor] --> EX[Exporter]
Evaluation Pipeline¶
Measures extraction quality against ground truth:
flowchart LR
TDL[TestDataLoader] --> R[Reader] --> C[Converter] --> E[Extractor] --> EV[Evaluator] --> EX[Exporter]
Component Overview¶
| Component | Purpose | Execution |
|---|---|---|
| FileLister | Discover input files | Sync (before orchestrator) |
| Reader | Read raw bytes | Thread pool (CPU-bound) |
| Converter | Parse to Document | Thread pool (CPU-bound) |
| Extractor | Extract structured data | Async (I/O-bound) |
| Exporter | Persist results | Async (I/O-bound) |
| TestDataLoader | Load evaluation examples | Sync (before orchestrator) |
| Evaluator | Compute metrics | Thread pool (CPU-bound) |
Concurrency Model¶
The orchestrators optimize for both CPU-bound and I/O-bound operations:
- Thread Pool - Reader and Converter run in a thread pool for CPU parallelism
- Async I/O - Extractor and Exporter run concurrently in the event loop
- Semaphore Control -
max_concurrencylimits concurrent async operations
# Tuning options in orchestrator config
max_workers: 4 # Thread pool size
max_concurrency: 10 # Async semaphore limit
Next Steps¶
- Data Models - Learn about the core types
- Extraction Pipeline - Deep dive into extraction
- Evaluation Pipeline - Set up evaluation
- Configuration - Configure your pipelines