Document Extraction Tools¶
A modular, high-performance toolkit for building document extraction pipelines.
What is Document Extraction Tools?¶
Document Extraction Tools provides clear interfaces for every stage of a document extraction pipeline, plus orchestrators that wire the stages together with async I/O and CPU-bound parallelism.
This library is intentionally implementation-light: you plug in your own components (readers, converters, extractors, exporters, evaluators) for each specific document type or data source.
Key Features¶
- Consistent Interfaces - A unified set of interfaces for the entire document-extraction lifecycle
- Typed Data Models - Strong typing for documents, pages, and extraction results
- Concurrent Orchestration - Run extraction and evaluation pipelines concurrently and safely
- Flexible Configuration - Pydantic + YAML configuration system for repeatable pipelines
Quick Example¶
from document_extraction_tools.runners import ExtractionOrchestrator
# Configure and run your pipeline
orchestrator = ExtractionOrchestrator.from_config(
config=config,
schema=LeaseSchema, # Your Pydantic schema
reader_cls=LocalReader,
converter_cls=PDFToImageConverter,
extractor_cls=GeminiImageExtractor,
exporter_cls=JSONExporter,
)
await orchestrator.run(file_paths)
Getting Started¶
-
Installation
Install document-extraction-tools with pip or uv
-
Quick Start
Build your first extraction pipeline
-
Core Concepts
Understand the architecture and components
-
Examples
See full working examples