Skip to content

Contributing

Contributions to Document Extraction Tools are welcome!

Getting Started

  1. Fork the repository
  2. Clone your fork:
    git clone https://github.com/YOUR_USERNAME/document-extraction-tools.git
    cd document-extraction-tools
    
  3. Install dependencies:
    uv sync
    

Development Workflow

Branch Naming

Use descriptive branch names with prefixes:

  • feat/short-description - New features
  • fix/short-description - Bug fixes
  • docs/short-description - Documentation updates
  • refactor/short-description - Code refactoring
  • test/short-description - Test additions/updates

Running Tests

uv run pytest

Linting and Formatting

Run pre-commit hooks before committing:

uv run pre-commit run --all-files

This runs:

  • Ruff - Linting and formatting
  • Type checking - Via pyright/mypy

Code Style

The project uses:

  • Ruff for linting and formatting
  • Google-style docstrings
  • Type hints throughout

Example:

class BaseExtractor(ABC):
    """Abstract interface for data extraction."""

    def __init__(
        self,
        config: BaseExtractorConfig | ExtractionPipelineConfig | EvaluationPipelineConfig,
    ) -> None:
        """Initialize with a configuration object.

        Args:
            config: Component-specific config or full pipeline configuration.
        """
        if isinstance(config, (ExtractionPipelineConfig, EvaluationPipelineConfig)):
            self.pipeline_config = config
            self.config = config.extractor
        else:
            self.pipeline_config = None
            self.config = config

    @abstractmethod
    async def extract(
        self,
        document: Document,
        schema: type[ExtractionSchema],
        context: PipelineContext | None = None,
    ) -> ExtractionResult[ExtractionSchema]:
        """Extracts structured data from a Document to match the provided Schema.

        Args:
            document: The fully parsed document.
            schema: The Pydantic model class defining the target structure.
            context: Optional shared pipeline context.

        Returns:
            An ExtractionResult containing the extracted data.
        """
        pass

Pull Request Process

  1. Create a new branch from main
  2. Make your changes
  3. Run tests and linting:
    uv run pre-commit run --all-files
    uv run pytest
    
  4. Commit with clear, descriptive messages
  5. Push to your fork
  6. Open a PR against main
  7. Fill out the PR template with:
  8. Description of changes
  9. Related issues
  10. Testing performed

Reporting Issues

Open an issue on GitHub with:

  • Clear description of the problem
  • Steps to reproduce
  • Expected vs actual behavior
  • Environment details (Python version, OS, etc.)

Maintainers

Feel free to reach out if you have questions about contributing!