Base Classes¶
Abstract base classes that define the interfaces for pipeline components.
Extraction Pipeline¶
BaseFileLister¶
Bases: ABC
Abstract interface for file discovery.
Attributes:
| Name | Type | Description |
|---|---|---|
config |
BaseFileListerConfig
|
Component-specific configuration. |
pipeline_config |
ExtractionPipelineConfig | None
|
Optional pipeline configuration when constructed with a pipeline config. |
Initialize with a configuration object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
BaseFileListerConfig | ExtractionPipelineConfig
|
Configuration specific to the file lister implementation or full pipeline configuration. |
required |
Source code in src/document_extraction_tools/base/file_lister/base_file_lister.py
list_files
abstractmethod
¶
Scans the target source and returns a list of file identifiers.
This method should handle the logic to return a clean list of work items.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
PipelineContext | None
|
Optional shared pipeline context. |
None
|
Returns:
| Type | Description |
|---|---|
list[PathIdentifier]
|
List[PathIdentifier]: A list of standardized objects containing the path and any necessary execution context. |
Source code in src/document_extraction_tools/base/file_lister/base_file_lister.py
BaseReader¶
Bases: ABC
Abstract interface for document ingestion.
Attributes:
| Name | Type | Description |
|---|---|---|
config |
BaseReaderConfig
|
Component-specific configuration. |
pipeline_config |
ExtractionPipelineConfig | EvaluationPipelineConfig | None
|
Optional pipeline configuration when constructed with a pipeline config. |
Initialize with a configuration object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
BaseReaderConfig | ExtractionPipelineConfig | EvaluationPipelineConfig
|
Configuration specific to the reader implementation or full pipeline configuration. |
required |
Source code in src/document_extraction_tools/base/reader/base_reader.py
pipeline_config
instance-attribute
¶
read
abstractmethod
¶
Reads a document from a specific source and returns its raw bytes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path_identifier
|
PathIdentifier
|
The identifier for the file. |
required |
context
|
PipelineContext | None
|
Optional shared pipeline context. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
DocumentBytes |
DocumentBytes
|
A standardized container with raw bytes and source metadata. |
Source code in src/document_extraction_tools/base/reader/base_reader.py
BaseConverter¶
Bases: ABC
Abstract interface for document transformation.
Attributes:
| Name | Type | Description |
|---|---|---|
config |
BaseConverterConfig
|
Component-specific configuration. |
pipeline_config |
ExtractionPipelineConfig | EvaluationPipelineConfig | None
|
Optional pipeline configuration when constructed with a pipeline config. |
Initialize with a configuration object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
BaseConverterConfig | ExtractionPipelineConfig | EvaluationPipelineConfig
|
Configuration specific to the converter implementation or full pipeline configuration. |
required |
Source code in src/document_extraction_tools/base/converter/base_converter.py
pipeline_config
instance-attribute
¶
convert
abstractmethod
¶
Transforms raw document bytes into a structured Document object.
This method should handle the parsing logic and map the metadata from the input bytes to the output document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_bytes
|
DocumentBytes
|
The standardized raw input containing file bytes and source metadata. |
required |
context
|
PipelineContext | None
|
Optional shared pipeline context. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Document |
Document
|
The fully structured document model ready for extraction. |
Source code in src/document_extraction_tools/base/converter/base_converter.py
BaseExtractor¶
Bases: ABC
Abstract interface for data extraction.
Attributes:
| Name | Type | Description |
|---|---|---|
config |
BaseExtractorConfig
|
Component-specific configuration. |
pipeline_config |
ExtractionPipelineConfig | EvaluationPipelineConfig | None
|
Optional pipeline configuration when constructed with a pipeline config. |
Initialize with a configuration object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
BaseExtractorConfig | ExtractionPipelineConfig | EvaluationPipelineConfig
|
Configuration specific to the extractor implementation or full pipeline configuration. |
required |
Source code in src/document_extraction_tools/base/extractor/base_extractor.py
pipeline_config
instance-attribute
¶
extract
abstractmethod
async
¶
extract(document: Document, schema: type[ExtractionSchema], context: PipelineContext | None = None) -> ExtractionResult[ExtractionSchema]
Extracts structured data from a Document to match the provided Schema.
This is an asynchronous operation to support I/O-bound tasks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document
|
Document
|
The fully parsed document. |
required |
schema
|
type[ExtractionSchema]
|
The Pydantic model class defining the target structure. |
required |
context
|
PipelineContext | None
|
Optional shared pipeline context. |
None
|
Returns:
| Type | Description |
|---|---|
ExtractionResult[ExtractionSchema]
|
ExtractionResult[ExtractionSchema]: The extracted data with metadata. |
Source code in src/document_extraction_tools/base/extractor/base_extractor.py
BaseExtractionExporter¶
Bases: ABC
Abstract interface for data persistence.
Attributes:
| Name | Type | Description |
|---|---|---|
config |
BaseExtractionExporterConfig
|
Component-specific configuration. |
pipeline_config |
ExtractionPipelineConfig | None
|
Optional pipeline configuration when constructed with a pipeline config. |
Initialize with a configuration object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
BaseExtractionExporterConfig | ExtractionPipelineConfig
|
Configuration specific to the exporter implementation or full pipeline configuration. |
required |
Source code in src/document_extraction_tools/base/exporter/base_extraction_exporter.py
export
abstractmethod
async
¶
export(document: Document, data: ExtractionResult[ExtractionSchema], context: PipelineContext | None = None) -> None
Persists extracted data to the configured destination.
This is an asynchronous operation to support non-blocking I/O writes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document
|
Document
|
The source document for this extraction. |
required |
data
|
ExtractionResult[ExtractionSchema]
|
The extracted data with metadata. |
required |
context
|
PipelineContext | None
|
Optional shared pipeline context. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
None |
None
|
The method should raise an exception if the export fails. |
Source code in src/document_extraction_tools/base/exporter/base_extraction_exporter.py
Evaluation Pipeline¶
BaseTestDataLoader¶
Bases: ABC, Generic[ExtractionSchema]
Abstract interface for loading evaluation test data.
Attributes:
| Name | Type | Description |
|---|---|---|
config |
BaseTestDataLoaderConfig
|
Component-specific configuration. |
pipeline_config |
EvaluationPipelineConfig | None
|
Optional pipeline configuration when constructed with a pipeline config. |
Initialize with a configuration object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
BaseTestDataLoaderConfig | EvaluationPipelineConfig
|
Configuration specific to the test data loader implementation or full pipeline configuration. |
required |
Source code in src/document_extraction_tools/base/test_data_loader/base_test_data_loader.py
load_test_data
abstractmethod
¶
load_test_data(path_identifier: PathIdentifier, context: PipelineContext | None = None) -> list[EvaluationExample[ExtractionSchema]]
Load test examples for evaluation.
This method should retrieve and return a list of EvaluationExample instances based on the provided path identifier.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path_identifier
|
PathIdentifier
|
The source location for loading evaluation examples. |
required |
context
|
PipelineContext | None
|
Optional shared pipeline context. |
None
|
Returns:
| Type | Description |
|---|---|
list[EvaluationExample[ExtractionSchema]]
|
list[EvaluationExample[ExtractionSchema]]: A list of evaluation examples for evaluation. |
Source code in src/document_extraction_tools/base/test_data_loader/base_test_data_loader.py
BaseEvaluator¶
Bases: ABC, Generic[ExtractionSchema]
Abstract interface for evaluation metrics.
Attributes:
| Name | Type | Description |
|---|---|---|
config |
BaseEvaluatorConfig
|
Component-specific configuration. |
pipeline_config |
EvaluationPipelineConfig | None
|
Optional pipeline configuration when constructed with a pipeline config. |
Initialize with a configuration object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
BaseEvaluatorConfig | EvaluationPipelineConfig
|
Configuration specific to the evaluator implementation or full pipeline configuration. |
required |
Source code in src/document_extraction_tools/base/evaluator/base_evaluator.py
_resolve_config
¶
Select the evaluator-specific config from the pipeline config.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pipeline_config
|
EvaluationPipelineConfig
|
Pipeline configuration with evaluator configs. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
BaseEvaluatorConfig |
BaseEvaluatorConfig
|
The config matching this evaluator. |
Source code in src/document_extraction_tools/base/evaluator/base_evaluator.py
evaluate
abstractmethod
¶
evaluate(true: ExtractionResult[ExtractionSchema], pred: ExtractionResult[ExtractionSchema], context: PipelineContext | None = None) -> EvaluationResult
Compute a metric for a single document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
true
|
ExtractionResult[ExtractionSchema]
|
Ground-truth data with metadata. |
required |
pred
|
ExtractionResult[ExtractionSchema]
|
Predicted data with metadata. |
required |
context
|
PipelineContext | None
|
Optional shared pipeline context. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
EvaluationResult |
EvaluationResult
|
The metric result for this document. |
Source code in src/document_extraction_tools/base/evaluator/base_evaluator.py
BaseEvaluationExporter¶
Bases: ABC
Abstract interface for exporting evaluation results.
Attributes:
| Name | Type | Description |
|---|---|---|
config |
BaseEvaluationExporterConfig
|
Component-specific configuration. |
pipeline_config |
EvaluationPipelineConfig | None
|
Optional pipeline configuration when constructed with a pipeline config. |
Initialize with a configuration object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
BaseEvaluationExporterConfig | EvaluationPipelineConfig
|
Configuration specific to the evaluation exporter implementation or full pipeline configuration. |
required |
Source code in src/document_extraction_tools/base/exporter/base_evaluation_exporter.py
export
abstractmethod
async
¶
export(results: list[tuple[Document, list[EvaluationResult]]], context: PipelineContext | None = None) -> None
Persist evaluation results to a target destination.
This is an asynchronous operation to support non-blocking I/O writes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
list[tuple[Document, list[EvaluationResult]]]
|
A list of tuples containing documents and their associated evaluation results. |
required |
context
|
PipelineContext | None
|
Optional shared pipeline context. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
None |
None
|
The method should raise an exception if the export fails. |
Source code in src/document_extraction_tools/base/exporter/base_evaluation_exporter.py
Orchestrators¶
ExtractionOrchestrator¶
Bases: Generic[ExtractionSchema]
Coordinates the document extraction pipeline.
This class manages the lifecycle of document processing, ensuring that CPU-bound tasks (Reading/Converting) are offloaded to a thread pool while I/O-bound tasks (Extracting/Exporting) run concurrently in the async event loop.
Attributes:
| Name | Type | Description |
|---|---|---|
config |
ExtractionOrchestratorConfig
|
Orchestrator configuration. |
file_lister |
BaseFileLister
|
File lister component instance. |
reader |
BaseReader
|
Reader component instance. |
converter |
BaseConverter
|
Converter component instance. |
extractor |
BaseExtractor
|
Extractor component instance. |
extraction_exporter |
BaseExtractionExporter
|
Extraction exporter component instance. |
schema |
type[ExtractionSchema]
|
Target extraction schema. |
Initialize the orchestrator with pipeline components.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
ExtractionOrchestratorConfig
|
Configuration for the orchestrator. |
required |
file_lister
|
BaseFileLister
|
Component to list input files. |
required |
reader
|
BaseReader
|
Component to read raw file bytes. |
required |
converter
|
BaseConverter
|
Component to transform bytes into Document objects. |
required |
extractor
|
BaseExtractor
|
Component to extract structured data via LLM. |
required |
extraction_exporter
|
BaseExtractionExporter
|
Component to persist the extraction results. |
required |
schema
|
type[ExtractionSchema]
|
The target Pydantic model definition for extraction. |
required |
Source code in src/document_extraction_tools/runners/extraction/extraction_orchestrator.py
from_config
classmethod
¶
from_config(config: ExtractionPipelineConfig, schema: type[ExtractionSchema], file_lister_cls: type[BaseFileLister], reader_cls: type[BaseReader], converter_cls: type[BaseConverter], extractor_cls: type[BaseExtractor], extraction_exporter_cls: type[BaseExtractionExporter]) -> ExtractionOrchestrator[ExtractionSchema]
Factory method to create an Orchestrator from a PipelineConfig.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
ExtractionPipelineConfig
|
The full pipeline configuration. |
required |
schema
|
type[ExtractionSchema]
|
The target Pydantic model definition for extraction. |
required |
file_lister_cls
|
type[BaseFileLister]
|
The concrete FileLister class to instantiate. |
required |
reader_cls
|
type[BaseReader]
|
The concrete Reader class to instantiate. |
required |
converter_cls
|
type[BaseConverter]
|
The concrete Converter class to instantiate. |
required |
extractor_cls
|
type[BaseExtractor]
|
The concrete Extractor class to instantiate. |
required |
extraction_exporter_cls
|
type[BaseExtractionExporter]
|
The concrete ExtractionExporter class to instantiate. |
required |
Returns:
| Type | Description |
|---|---|
ExtractionOrchestrator[ExtractionSchema]
|
ExtractionOrchestrator[ExtractionSchema]: The configured orchestrator instance. |
Source code in src/document_extraction_tools/runners/extraction/extraction_orchestrator.py
_ingest
staticmethod
¶
_ingest(path_identifier: PathIdentifier, reader: BaseReader, converter: BaseConverter, context: PipelineContext) -> Document
Performs the CPU-bound ingestion phase.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path_identifier
|
PathIdentifier
|
The path identifier to the source file. |
required |
reader
|
BaseReader
|
The reader instance to use. |
required |
converter
|
BaseConverter
|
The converter instance to use. |
required |
context
|
PipelineContext
|
Shared pipeline context. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Document |
Document
|
The fully parsed document object. |
Source code in src/document_extraction_tools/runners/extraction/extraction_orchestrator.py
_run_in_executor_with_context
async
staticmethod
¶
_run_in_executor_with_context(loop: AbstractEventLoop, pool: ThreadPoolExecutor, func: Callable[..., T], *args: object) -> T
Run a function in an executor while preserving contextvars.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
loop
|
AbstractEventLoop
|
The event loop to use. |
required |
pool
|
ThreadPoolExecutor
|
The thread pool to run the function in. |
required |
func
|
Callable[..., T]
|
The function to execute. |
required |
*args
|
object
|
Arguments to pass to the function. |
()
|
Returns:
| Type | Description |
|---|---|
T
|
The result of the function execution. |
Source code in src/document_extraction_tools/runners/extraction/extraction_orchestrator.py
process_document
async
¶
process_document(path_identifier: PathIdentifier, pool: ThreadPoolExecutor, semaphore: Semaphore, context: PipelineContext) -> None
Runs the full processing lifecycle for a single document.
- Ingest (Read+Convert) -> Offloaded to ThreadPool (CPU).
- Extract -> Async Wait (I/O).
- Export -> Async Wait (I/O).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path_identifier
|
PathIdentifier
|
The input file to process. |
required |
pool
|
ThreadPoolExecutor
|
The shared pool for CPU tasks. |
required |
semaphore
|
Semaphore
|
The shared limiter for I/O tasks. |
required |
context
|
PipelineContext
|
Shared pipeline context. |
required |
Source code in src/document_extraction_tools/runners/extraction/extraction_orchestrator.py
run
async
¶
Main entry point. Orchestrates the execution of the provided file list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_paths_to_process
|
list[PathIdentifier]
|
The list of file paths to process. |
required |
context
|
PipelineContext | None
|
Optional shared pipeline context. |
None
|
Source code in src/document_extraction_tools/runners/extraction/extraction_orchestrator.py
EvaluationOrchestrator¶
Bases: Generic[ExtractionSchema]
Coordinates evaluation across multiple evaluators.
Attributes:
| Name | Type | Description |
|---|---|---|
config |
EvaluationOrchestratorConfig
|
Orchestrator configuration. |
test_data_loader |
BaseTestDataLoader[ExtractionSchema]
|
Test data loader instance. |
reader |
BaseReader
|
Reader component instance. |
converter |
BaseConverter
|
Converter component instance. |
extractor |
BaseExtractor
|
Extractor component instance. |
evaluators |
list[BaseEvaluator[ExtractionSchema]]
|
Evaluator instances. |
evaluation_exporter |
BaseEvaluationExporter
|
Evaluation exporter instance. |
schema |
type[ExtractionSchema]
|
Target extraction schema. |
Initialize the evaluation orchestrator with pipeline components.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
EvaluationOrchestratorConfig
|
Configuration for evaluation orchestration. |
required |
test_data_loader
|
BaseTestDataLoader[ExtractionSchema]
|
Component to load evaluation examples. |
required |
reader
|
BaseReader
|
Component to read raw file bytes. |
required |
converter
|
BaseConverter
|
Component to transform bytes into Document objects. |
required |
extractor
|
BaseExtractor
|
Component to generate predictions. |
required |
evaluators
|
Iterable[BaseEvaluator[ExtractionSchema]]
|
Metrics to apply to each example. |
required |
evaluation_exporter
|
BaseEvaluationExporter
|
Component to persist evaluation results. |
required |
schema
|
type[ExtractionSchema]
|
The target Pydantic model definition for extraction. |
required |
Source code in src/document_extraction_tools/runners/evaluation/evaluation_orchestrator.py
from_config
classmethod
¶
from_config(config: EvaluationPipelineConfig, schema: type[ExtractionSchema], reader_cls: type[BaseReader], converter_cls: type[BaseConverter], extractor_cls: type[BaseExtractor], test_data_loader_cls: type[BaseTestDataLoader[ExtractionSchema]], evaluator_classes: list[type[BaseEvaluator[ExtractionSchema]]], evaluation_exporter_cls: type[BaseEvaluationExporter]) -> EvaluationOrchestrator[ExtractionSchema]
Factory method to create an EvaluationOrchestrator from config.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
EvaluationPipelineConfig
|
The full evaluation pipeline configuration. |
required |
schema
|
type[ExtractionSchema]
|
The target Pydantic model definition for extraction. |
required |
reader_cls
|
type[BaseReader]
|
The concrete Reader class to instantiate. |
required |
converter_cls
|
type[BaseConverter]
|
The concrete Converter class to instantiate. |
required |
extractor_cls
|
type[BaseExtractor]
|
The concrete Extractor class to instantiate. |
required |
test_data_loader_cls
|
type[BaseTestDataLoader[ExtractionSchema]]
|
The concrete TestDataLoader class to instantiate. |
required |
evaluator_classes
|
list[type[BaseEvaluator[ExtractionSchema]]]
|
The evaluator classes available for instantiation. |
required |
evaluation_exporter_cls
|
type[BaseEvaluationExporter]
|
The concrete EvaluationExporter class to instantiate. |
required |
Returns:
| Type | Description |
|---|---|
EvaluationOrchestrator[ExtractionSchema]
|
EvaluationOrchestrator[ExtractionSchema]: The configured orchestrator. |
Source code in src/document_extraction_tools/runners/evaluation/evaluation_orchestrator.py
_ingest
staticmethod
¶
_ingest(path_identifier: PathIdentifier, reader: BaseReader, converter: BaseConverter, context: PipelineContext) -> Document
Performs the CPU-bound ingestion phase.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path_identifier
|
PathIdentifier
|
The path identifier to the source file. |
required |
reader
|
BaseReader
|
The reader instance to use. |
required |
converter
|
BaseConverter
|
The converter instance to use. |
required |
context
|
PipelineContext
|
Shared pipeline context. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Document |
Document
|
The fully parsed document object. |
Source code in src/document_extraction_tools/runners/evaluation/evaluation_orchestrator.py
_run_in_executor_with_context
async
staticmethod
¶
_run_in_executor_with_context(loop: AbstractEventLoop, pool: ThreadPoolExecutor, func: Callable[..., T], *args: object) -> T
Run a function in an executor while preserving contextvars.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
loop
|
AbstractEventLoop
|
The event loop to use. |
required |
pool
|
ThreadPoolExecutor
|
The thread pool to run the function in. |
required |
func
|
Callable[..., T]
|
The function to execute. |
required |
*args
|
object
|
Arguments to pass to the function. |
()
|
Returns:
| Type | Description |
|---|---|
T
|
The result of the function execution. |
Source code in src/document_extraction_tools/runners/evaluation/evaluation_orchestrator.py
process_example
async
¶
process_example(example: EvaluationExample[ExtractionSchema], pool: ThreadPoolExecutor, semaphore: Semaphore, context: PipelineContext) -> tuple[Document, list[EvaluationResult]]
Runs extraction, evaluation, and export for a single example.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
example
|
EvaluationExample[ExtractionSchema]
|
The evaluation example to process. |
required |
pool
|
ThreadPoolExecutor
|
The thread pool for CPU-bound tasks. |
required |
semaphore
|
Semaphore
|
Semaphore to limit concurrency. |
required |
context
|
PipelineContext
|
Shared pipeline context. |
required |
Returns:
| Type | Description |
|---|---|
tuple[Document, list[EvaluationResult]]
|
tuple[Document, list[EvaluationResult]]: The document and its evaluation results. |
Source code in src/document_extraction_tools/runners/evaluation/evaluation_orchestrator.py
run
async
¶
run(examples: list[EvaluationExample[ExtractionSchema]], context: PipelineContext | None = None) -> None
Run all evaluators and export results for the provided examples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
examples
|
list[EvaluationExample[ExtractionSchema]]
|
The evaluation examples to evaluate. |
required |
context
|
PipelineContext | None
|
Optional shared pipeline context. |
None
|
Source code in src/document_extraction_tools/runners/evaluation/evaluation_orchestrator.py
Import Shortcuts¶
All base classes can be imported from the top-level base module: