Configuration¶
Document Extraction Tools uses Pydantic models with YAML files for configuration.
Configuration System¶
Each component has a matching base config class that you subclass to add your own fields.
Base Config Classes¶
Extraction Pipeline:
BaseFileListerConfigBaseReaderConfigBaseConverterConfigBaseExtractorConfigBaseExtractionExporterConfigExtractionOrchestratorConfig
Evaluation Pipeline:
BaseTestDataLoaderConfigBaseEvaluatorConfigBaseEvaluationExporterConfigEvaluationOrchestratorConfig
Pipeline Config Classes¶
All component configs are aggregated into master pipeline config classes:
ExtractionPipelineConfig- Contains all extraction pipeline component configsEvaluationPipelineConfig- Contains all evaluation pipeline component configs
These pipeline configs are returned by the config loaders and can be passed directly to the orchestrator's from_config() factory method.
Creating Custom Configs¶
Subclass the base config to add your specific fields:
from document_extraction_tools.config import BaseExtractorConfig
class MyExtractorConfig(BaseExtractorConfig):
model_name: str
temperature: float = 0.0
max_tokens: int = 4096
YAML Files¶
Create YAML files in your config directory (default: config/yaml/):
Default Filenames¶
| Config Class | Default Filename |
|---|---|
| FileListerConfig | file_lister.yaml |
| ReaderConfig | reader.yaml |
| ConverterConfig | converter.yaml |
| ExtractorConfig | extractor.yaml |
| ExtractionExporterConfig | extraction_exporter.yaml |
| ExtractionOrchestratorConfig | extraction_orchestrator.yaml |
| TestDataLoaderConfig | test_data_loader.yaml |
| EvaluatorConfig | evaluator.yaml |
| EvaluationExporterConfig | evaluation_exporter.yaml |
| EvaluationOrchestratorConfig | evaluation_orchestrator.yaml |
Example YAML Files¶
Evaluator Config Format¶
Evaluator configs use a special format where the top-level key matches the config class name:
FieldAccuracyEvaluatorConfig:
threshold: 0.8
LevenshteinEvaluatorConfig:
normalize: true
Loading Configuration¶
Extraction Config¶
from pathlib import Path
from document_extraction_tools.config import load_extraction_config
config = load_extraction_config(
lister_config_cls=MyFileListerConfig,
reader_config_cls=MyReaderConfig,
converter_config_cls=MyConverterConfig,
extractor_config_cls=MyExtractorConfig,
extraction_exporter_config_cls=MyExtractionExporterConfig,
config_dir=Path("config/yaml"),
)
# Access individual configs
print(config.extractor.model_name)
print(config.extraction_orchestrator.max_workers)
Evaluation Config¶
from document_extraction_tools.config import load_evaluation_config
config = load_evaluation_config(
test_data_loader_config_cls=MyTestDataLoaderConfig,
evaluator_config_classes=[
FieldAccuracyEvaluatorConfig,
LevenshteinEvaluatorConfig,
],
reader_config_cls=MyReaderConfig,
converter_config_cls=MyConverterConfig,
extractor_config_cls=MyExtractorConfig,
evaluation_exporter_config_cls=MyEvaluationExporterConfig,
config_dir=Path("config/yaml"),
)
Pipeline Context¶
In addition to static configuration, you can pass runtime state through the pipeline using PipelineContext. This is useful for run IDs, environment settings, and cross-cutting concerns like logging and tracing.
See the dedicated Pipeline Context guide for detailed usage and examples.
Environment Variables¶
You can use environment variables in YAML files with the !env tag (if you implement a custom YAML loader) or by using Pydantic's environment variable support in your config classes.