Sourcery is a Python LLM extraction framework for turning unstructured text, PDFs, HTML, URLs, and VLM OCR image sources into typed, source-grounded Pydantic data.
Define your extraction schema with Pydantic, run chunked model extraction, align every result back to source spans, optionally reconcile mentions into canonical claims, and export JSONL or HTML review workflows.
Sourcery is for people building:
- document AI pipelines,
- compliance and legal extraction systems,
- financial filing intelligence,
- contract and policy analyzers,
- review workflows with human approval.
Core idea:
- Define extraction contracts in Pydantic v2.
- Run deterministic chunked extraction with LLM structured output.
- Align results to source offsets.
- Reconcile at document-level into canonical claims.
- Review/export via JSONL + HTML reviewer.
Sourcery is optimized for type safety + runtime reliability + deterministic post-processing.
- Pydantic contracts are first-class (
EntitySpec.attributes_model). - BlackGeorge-native runtime orchestration (no custom provider router stack).
- Deterministic alignment statuses (
exact,fuzzy,partial,unresolved). - Deterministic merge behavior across passes.
- Typed error taxonomy for provider/runtime/pipeline/ingestion failures.
- Run replay via BlackGeorge run store.
- Built-in reviewer UI (search/filter/approve/export).
- Document-level reconciliation support with BlackGeorge Workforce + Blackboard + resolver worker.
Sourcery is an application layer on top of BlackGeorge runtime primitives (Desk, Flow, Worker, Workforce, RunStore, EventBus).
- Sourcery handles extraction domain logic.
- BlackGeorge handles model execution, workflow orchestration, events, pause/resume, and run storage.
This means BlackGeorge is a hard runtime dependency in this project.
- Schema-first extraction with Pydantic models.
- Ingestion adapters: text, file, PDF, HTML, URL, VLM-based image OCR.
- Deterministic chunking and alignment.
- Multi-pass extraction with stop-when-no-new-results.
- Cross-chunk refinement and document-level reconciliation.
- Session-based refinement mode.
- Reviewer HTML UI + export to JSONL/CSV.
- Real async extraction (native async/await, no thread pools).
- Streaming extraction — yields results per chunk as they land.
- Run tracing and replay.
uv sync --extra dev --extra ingestPyPI distribution name: sourceryforge
Python import path: sourcery
Install from PyPI:
pip install sourceryforgeOr with uv:
uv add sourceryforgeIf you want benchmark tooling:
uv sync --extra benchmarkSet your provider key (example):
export DEEPSEEK_API_KEY="..."Set RuntimeConfig.model to a provider/model route supported by your BlackGeorge runtime setup.
Run the benchmark from this repo root:
uv run sourcery-benchmark --text-types english,japanese,french,spanish --max-chars 4500 --max-passes 2 --sourcery-model deepseek/deepseek-chatRun it from any directory:
uv run --project /path/to/sourcery sourcery-benchmark --text-types englishOr run the compatibility wrapper:
uv run benchmark_compare.py --text-types englishOutput JSON is written to benchmark_results/ and includes:
- run settings,
- tokenization throughput table,
- per-language extraction metrics,
- aggregate framework summaries.
The benchmark runner compares Sourcery with LangExtract using a similar Gutenberg sampling flow. It is not a byte-for-byte clone of LangExtract's benchmark script.
- Ported: Gutenberg text sampling flow, per-language extraction runs, retry behavior, timing, grounded/unresolved metrics, JSON output artifacts.
from pydantic import BaseModel
import sourcery
from sourcery.contracts import (
EntitySchemaSet,
EntitySpec,
ExtractRequest,
ExtractionExample,
ExtractionTask,
ExampleExtraction,
RuntimeConfig,
)
class PersonAttrs(BaseModel):
role: str | None = None
request = ExtractRequest(
documents="Alice is the CEO of Acme.",
task=ExtractionTask(
instructions="Extract people.",
schema=EntitySchemaSet(
entities=[EntitySpec(name="person", attributes_model=PersonAttrs)]
),
examples=[
ExtractionExample(
text="Bob is the CTO.",
extractions=[
ExampleExtraction(entity="person", text="Bob", attributes={"role": "CTO"})
],
)
],
),
runtime=RuntimeConfig(model="deepseek/deepseek-chat"),
)
result = sourcery.extract(request)
print(result.metrics.model_dump(mode="json"))More examples: CODE_EXAMPLES.md
Full usage and API guide: USAGE.md
Notebook workflows: examples/notebooks/sourcery_quickstart.ipynb, examples/notebooks/sourcery_pdf_workflow.ipynb
sourcery/contracts: public types and contracts.sourcery/pipeline: chunking, prompt compiler, aligner, merger.sourcery/runtime: engine + BlackGeorge runtime integration.sourcery/ingest: document loaders and adapters.sourcery/io: JSONL, visualization, reviewer UI.sourcery/observability: run trace collection.
uv run --extra dev pytest -q
uv run --extra dev ruff check sourcery tests
uv run --extra dev mypy sourceryBuild and serve project docs with MkDocs:
uv run --extra docs mkdocs serve
uv run --extra docs mkdocs build --strict- Regulatory compliance extraction.
- SEC filing and earnings-call intelligence.
- Contract clause extraction and renewal tracking.
- Policy change monitoring.
- Research paper benchmark extraction.
- Incident/postmortem structure mining.
Licensed under the MIT License. See LICENSE.