Contributing¶

Contributions to Document Extraction Tools are welcome!

Getting Started¶

Fork the repository

Clone your fork:

git clone https://github.com/YOUR_USERNAME/document-extraction-tools.git
cd document-extraction-tools

Install dependencies:
```
uv sync
```

Development Workflow¶

Branch Naming¶

Use descriptive branch names with prefixes:

feat/short-description - New features
fix/short-description - Bug fixes
docs/short-description - Documentation updates
refactor/short-description - Code refactoring
test/short-description - Test additions/updates

Running Tests¶

uv run pytest

Linting and Formatting¶

Run pre-commit hooks before committing:

uv run pre-commit run --all-files

This runs:

Ruff - Linting and formatting
Type checking - Via pyright/mypy

Code Style¶

The project uses:

Ruff for linting and formatting
Google-style docstrings
Type hints throughout

Example:

class BaseExtractor(ABC):
    """Abstract interface for data extraction."""

    def __init__(
        self,
        config: BaseExtractorConfig | ExtractionPipelineConfig | EvaluationPipelineConfig,
    ) -> None:
        """Initialize with a configuration object.

        Args:
            config: Component-specific config or full pipeline configuration.
        """
        if isinstance(config, (ExtractionPipelineConfig, EvaluationPipelineConfig)):
            self.pipeline_config = config
            self.config = config.extractor
        else:
            self.pipeline_config = None
            self.config = config

    @abstractmethod
    async def extract(
        self,
        document: Document,
        schema: type[ExtractionSchema],
        context: PipelineContext | None = None,
    ) -> ExtractionResult[ExtractionSchema]:
        """Extracts structured data from a Document to match the provided Schema.

        Args:
            document: The fully parsed document.
            schema: The Pydantic model class defining the target structure.
            context: Optional shared pipeline context.

        Returns:
            An ExtractionResult containing the extracted data.
        """
        pass

Pull Request Process¶

Create a new branch from main
Make your changes

Run tests and linting:

uv run pre-commit run --all-files
uv run pytest

Commit with clear, descriptive messages
Push to your fork
Open a PR against main
Fill out the PR template with:
Description of changes
Related issues
Testing performed

Reporting Issues¶

Open an issue on GitHub with:

Clear description of the problem
Steps to reproduce
Expected vs actual behavior
Environment details (Python version, OS, etc.)

Maintainers¶

Feel free to reach out if you have questions about contributing!