Skip to content

Configuration API

Configuration classes and loading utilities.

Config Loaders

load_extraction_config

load_extraction_config

load_extraction_config(lister_config_cls: type[BaseFileListerConfig], reader_config_cls: type[BaseReaderConfig], converter_config_cls: type[BaseConverterConfig], extractor_config_cls: type[BaseExtractorConfig], extraction_exporter_config_cls: type[BaseExtractionExporterConfig], extraction_orchestrator_config_cls: type[ExtractionOrchestratorConfig] = ExtractionOrchestratorConfig, config_dir: Path = Path('config/yaml')) -> ExtractionPipelineConfig

Loads extraction configuration based on component filenames.

Parameters:

Name Type Description Default
lister_config_cls type[BaseFileListerConfig]

The FileListerConfig subclass to use.

required
reader_config_cls type[BaseReaderConfig]

The ReaderConfig subclass to use.

required
converter_config_cls type[BaseConverterConfig]

The ConverterConfig subclass to use.

required
extractor_config_cls type[BaseExtractorConfig]

The ExtractorConfig subclass to use.

required
extraction_exporter_config_cls type[BaseExtractionExporterConfig]

The ExtractionExporterConfig subclass to use.

required
extraction_orchestrator_config_cls type[ExtractionOrchestratorConfig]

The ExtractionOrchestratorConfig class to use.

ExtractionOrchestratorConfig
config_dir Path

Directory containing the configs.

Path('config/yaml')

Returns:

Name Type Description
ExtractionPipelineConfig ExtractionPipelineConfig

The fully validated configuration.

Raises:

Type Description
FileNotFoundError

If the config directory or mapping file is missing.

Source code in src/document_extraction_tools/config/config_loader.py
def load_extraction_config(
    lister_config_cls: type[BaseFileListerConfig],
    reader_config_cls: type[BaseReaderConfig],
    converter_config_cls: type[BaseConverterConfig],
    extractor_config_cls: type[BaseExtractorConfig],
    extraction_exporter_config_cls: type[BaseExtractionExporterConfig],
    extraction_orchestrator_config_cls: type[
        ExtractionOrchestratorConfig
    ] = ExtractionOrchestratorConfig,
    config_dir: Path = Path("config/yaml"),
) -> ExtractionPipelineConfig:
    """Loads extraction configuration based on component filenames.

    Args:
        lister_config_cls (type[BaseFileListerConfig]): The FileListerConfig subclass to use.
        reader_config_cls (type[BaseReaderConfig]): The ReaderConfig subclass to use.
        converter_config_cls (type[BaseConverterConfig]): The ConverterConfig subclass to use.
        extractor_config_cls (type[BaseExtractorConfig]): The ExtractorConfig subclass to use.
        extraction_exporter_config_cls (type[BaseExtractionExporterConfig]): The
            ExtractionExporterConfig subclass to use.
        extraction_orchestrator_config_cls (type[ExtractionOrchestratorConfig]): The
            ExtractionOrchestratorConfig class to use.
        config_dir (Path): Directory containing the configs.

    Returns:
        ExtractionPipelineConfig: The fully validated configuration.

    Raises:
        FileNotFoundError: If the config directory or mapping file is missing.
    """
    if not config_dir.exists():
        raise FileNotFoundError(f"Config directory not found: {config_dir.absolute()}")

    return ExtractionPipelineConfig(
        extraction_orchestrator=extraction_orchestrator_config_cls(
            **_load_yaml(config_dir / extraction_orchestrator_config_cls.filename)
        ),
        file_lister=lister_config_cls(
            **_load_yaml(config_dir / lister_config_cls.filename)
        ),
        reader=reader_config_cls(**_load_yaml(config_dir / reader_config_cls.filename)),
        converter=converter_config_cls(
            **_load_yaml(config_dir / converter_config_cls.filename)
        ),
        extractor=extractor_config_cls(
            **_load_yaml(config_dir / extractor_config_cls.filename)
        ),
        extraction_exporter=extraction_exporter_config_cls(
            **_load_yaml(config_dir / extraction_exporter_config_cls.filename)
        ),
    )

load_evaluation_config

load_evaluation_config

load_evaluation_config(test_data_loader_config_cls: type[BaseTestDataLoaderConfig], evaluator_config_classes: list[type[BaseEvaluatorConfig]], reader_config_cls: type[BaseReaderConfig], converter_config_cls: type[BaseConverterConfig], extractor_config_cls: type[BaseExtractorConfig], evaluation_exporter_config_cls: type[BaseEvaluationExporterConfig], evaluation_orchestrator_config_cls: type[EvaluationOrchestratorConfig] = EvaluationOrchestratorConfig, config_dir: Path = Path('config/yaml')) -> EvaluationPipelineConfig

Loads evaluation configuration based on component filenames.

Parameters:

Name Type Description Default
test_data_loader_config_cls type[BaseTestDataLoaderConfig]

The TestDataLoaderConfig subclass to use.

required
evaluator_config_classes list[type[BaseEvaluatorConfig]]

EvaluatorConfig subclasses to load using the top-level keys in evaluator.yaml.

required
reader_config_cls type[BaseReaderConfig]

The ReaderConfig subclass to use.

required
converter_config_cls type[BaseConverterConfig]

The ConverterConfig subclass to use.

required
extractor_config_cls type[BaseExtractorConfig]

The ExtractorConfig subclass to use.

required
evaluation_exporter_config_cls type[BaseEvaluationExporterConfig]

The EvaluationExporterConfig subclass to use.

required
evaluation_orchestrator_config_cls type[EvaluationOrchestratorConfig]

The EvaluationOrchestratorConfig class to use.

EvaluationOrchestratorConfig
config_dir Path

Directory containing the configs.

Path('config/yaml')

Returns:

Name Type Description
EvaluationPipelineConfig EvaluationPipelineConfig

The fully validated configuration.

Raises:

Type Description
FileNotFoundError

If the config directory or mapping file is missing.

Source code in src/document_extraction_tools/config/config_loader.py
def load_evaluation_config(
    test_data_loader_config_cls: type[BaseTestDataLoaderConfig],
    evaluator_config_classes: list[type[BaseEvaluatorConfig]],
    reader_config_cls: type[BaseReaderConfig],
    converter_config_cls: type[BaseConverterConfig],
    extractor_config_cls: type[BaseExtractorConfig],
    evaluation_exporter_config_cls: type[BaseEvaluationExporterConfig],
    evaluation_orchestrator_config_cls: type[
        EvaluationOrchestratorConfig
    ] = EvaluationOrchestratorConfig,
    config_dir: Path = Path("config/yaml"),
) -> EvaluationPipelineConfig:
    """Loads evaluation configuration based on component filenames.

    Args:
        test_data_loader_config_cls (type[BaseTestDataLoaderConfig]): The TestDataLoaderConfig subclass to use.
        evaluator_config_classes (list[type[BaseEvaluatorConfig]]): EvaluatorConfig
            subclasses to load using the top-level keys in evaluator.yaml.
        reader_config_cls (type[BaseReaderConfig]): The ReaderConfig subclass to use.
        converter_config_cls (type[BaseConverterConfig]): The ConverterConfig subclass to use.
        extractor_config_cls (type[BaseExtractorConfig]): The ExtractorConfig subclass to use.
        evaluation_exporter_config_cls (type[BaseEvaluationExporterConfig]): The EvaluationExporterConfig
            subclass to use.
        evaluation_orchestrator_config_cls (type[EvaluationOrchestratorConfig]): The
            EvaluationOrchestratorConfig class to use.
        config_dir (Path): Directory containing the configs.

    Returns:
        EvaluationPipelineConfig: The fully validated configuration.

    Raises:
        FileNotFoundError: If the config directory or mapping file is missing.
    """
    if not config_dir.exists():
        raise FileNotFoundError(f"Config directory not found: {config_dir.absolute()}")

    return EvaluationPipelineConfig(
        evaluation_orchestrator=evaluation_orchestrator_config_cls(
            **_load_yaml(config_dir / evaluation_orchestrator_config_cls.filename)
        ),
        test_data_loader=test_data_loader_config_cls(
            **_load_yaml(config_dir / test_data_loader_config_cls.filename)
        ),
        evaluators=_load_evaluator_configs(config_dir, evaluator_config_classes),
        reader=reader_config_cls(**_load_yaml(config_dir / reader_config_cls.filename)),
        converter=converter_config_cls(
            **_load_yaml(config_dir / converter_config_cls.filename)
        ),
        extractor=extractor_config_cls(
            **_load_yaml(config_dir / extractor_config_cls.filename)
        ),
        evaluation_exporter=evaluation_exporter_config_cls(
            **_load_yaml(config_dir / evaluation_exporter_config_cls.filename)
        ),
    )

Pipeline Configs

These master config classes aggregate all component configurations for a pipeline.

ExtractionPipelineConfig

ExtractionPipelineConfig

Bases: BaseModel

Master container for extraction pipeline component configurations.

This class aggregates the configurations for all pipeline components.

extraction_orchestrator class-attribute instance-attribute
extraction_orchestrator: ExtractionOrchestratorConfig = Field(..., description='Configuration for orchestrating extraction execution.')
file_lister class-attribute instance-attribute
file_lister: BaseFileListerConfig = Field(..., description='Configuration for file discovery.')
reader class-attribute instance-attribute
reader: BaseReaderConfig = Field(..., description='Configuration for reading raw document bytes.')
converter class-attribute instance-attribute
converter: BaseConverterConfig = Field(..., description='Configuration for converting raw bytes into documents.')
extractor class-attribute instance-attribute
extractor: BaseExtractorConfig = Field(..., description='Configuration for extracting structured data.')
extraction_exporter class-attribute instance-attribute
extraction_exporter: BaseExtractionExporterConfig = Field(..., description='Configuration for exporting extracted data.')

EvaluationPipelineConfig

EvaluationPipelineConfig

Bases: BaseModel

Master container for evaluation pipeline component configurations.

This class aggregates the configurations for all evaluation pipeline components.

evaluation_orchestrator class-attribute instance-attribute
evaluation_orchestrator: EvaluationOrchestratorConfig = Field(..., description='Configuration for orchestrating evaluation execution.')
test_data_loader class-attribute instance-attribute
test_data_loader: BaseTestDataLoaderConfig = Field(..., description='Configuration for loading evaluation examples.')
evaluators class-attribute instance-attribute
evaluators: list[BaseEvaluatorConfig] = Field(..., description='Evaluator configurations to apply.')
reader class-attribute instance-attribute
reader: BaseReaderConfig = Field(..., description='Configuration for reading raw document bytes.')
converter class-attribute instance-attribute
converter: BaseConverterConfig = Field(..., description='Configuration for converting raw bytes into documents.')
extractor class-attribute instance-attribute
extractor: BaseExtractorConfig = Field(..., description='Configuration for extracting structured data.')
evaluation_exporter class-attribute instance-attribute
evaluation_exporter: BaseEvaluationExporterConfig = Field(..., description='Configuration for exporting evaluation results.')

Extraction Pipeline Configs

BaseFileListerConfig

BaseFileListerConfig

Bases: BaseModel

Base config for File Listers.

Implementations should subclass this to add specific parameters.

filename class-attribute
filename: str = 'file_lister.yaml'

BaseReaderConfig

BaseReaderConfig

Bases: BaseModel

Base config for Readers.

Implementations should subclass this to add specific parameters.

filename class-attribute
filename: str = 'reader.yaml'

BaseConverterConfig

BaseConverterConfig

Bases: BaseModel

Base config for Converters.

Implementations should subclass this to add specific parameters.

filename class-attribute
filename: str = 'converter.yaml'

BaseExtractorConfig

BaseExtractorConfig

Bases: BaseModel

Base config for Extractors.

Implementations should subclass this to add specific parameters.

filename class-attribute
filename: str = 'extractor.yaml'

BaseExtractionExporterConfig

BaseExtractionExporterConfig

Bases: BaseModel

Base config for Exporters.

Implementations should subclass this to add specific parameters.

filename class-attribute
filename: str = 'extraction_exporter.yaml'

ExtractionOrchestratorConfig

ExtractionOrchestratorConfig

Bases: BaseModel

Configuration for the Pipeline Orchestrator.

filename class-attribute
filename: str = 'extraction_orchestrator.yaml'
max_workers class-attribute instance-attribute
max_workers: int = Field(default=4, description='Number of processes to use for CPU-bound tasks.')
max_concurrency class-attribute instance-attribute
max_concurrency: int = Field(default=10, description='Maximum number of concurrent I/O requests allowed.')

Evaluation Pipeline Configs

BaseTestDataLoaderConfig

BaseTestDataLoaderConfig

Bases: BaseModel

Base config for Test Data Loaders.

Implementations should subclass this to add specific parameters.

filename class-attribute
filename: str = 'test_data_loader.yaml'

BaseEvaluatorConfig

BaseEvaluatorConfig

Bases: BaseModel

Base config for Evaluators.

Implementations should subclass this to add specific parameters.

filename class-attribute
filename: str = 'evaluator.yaml'

BaseEvaluationExporterConfig

BaseEvaluationExporterConfig

Bases: BaseModel

Base config for Evaluation Exporters.

Implementations should subclass this to add specific parameters.

filename class-attribute
filename: str = 'evaluation_exporter.yaml'

EvaluationOrchestratorConfig

EvaluationOrchestratorConfig

Bases: BaseModel

Configuration for the Evaluation Orchestrator.

filename class-attribute
filename: str = 'evaluation_orchestrator.yaml'
max_workers class-attribute instance-attribute
max_workers: int = Field(default=4, description='Number of processes to use for CPU-bound tasks.')
max_concurrency class-attribute instance-attribute
max_concurrency: int = Field(default=10, description='Maximum number of concurrent I/O requests allowed.')

Creating Custom Configs

Subclass the base config to add your fields:

from document_extraction_tools.config import BaseExtractorConfig

class MyExtractorConfig(BaseExtractorConfig):
    model_name: str
    temperature: float = 0.0
    max_tokens: int = 4096
    api_key: str | None = None

Then create the corresponding YAML file:

config/yaml/extractor.yaml
model_name: "gpt-4"
temperature: 0.1
max_tokens: 8192