Skip to content

Document Extraction Tools

A modular, high-performance toolkit for building document extraction pipelines.


What is Document Extraction Tools?

Document Extraction Tools provides clear interfaces for every stage of a document extraction pipeline, plus orchestrators that wire the stages together with async I/O and CPU-bound parallelism.

This library is intentionally implementation-light: you plug in your own components (readers, converters, extractors, exporters, evaluators) for each specific document type or data source.

Key Features

  • Consistent Interfaces - A unified set of interfaces for the entire document-extraction lifecycle
  • Typed Data Models - Strong typing for documents, pages, and extraction results
  • Concurrent Orchestration - Run extraction and evaluation pipelines concurrently and safely
  • Flexible Configuration - Pydantic + YAML configuration system for repeatable pipelines

Quick Example

from document_extraction_tools.runners import ExtractionOrchestrator

# Configure and run your pipeline
orchestrator = ExtractionOrchestrator.from_config(
    config=config,
    schema=LeaseSchema,  # Your Pydantic schema
    reader_cls=LocalReader,
    converter_cls=PDFToImageConverter,
    extractor_cls=GeminiImageExtractor,
    exporter_cls=JSONExporter,
)

await orchestrator.run(file_paths)

Getting Started

Project Layout

.
├── src/document_extraction_tools/
│   ├── base/           # Abstract base classes you implement
│   ├── config/         # Pydantic configs + YAML loader helpers
│   ├── runners/        # Orchestrators that run pipelines
│   └── types/          # Shared models/types
├── tests/
├── pyproject.toml
└── README.md