Chronoscribe

OCR text restoration pipeline for digitized historical texts.

Uses Coleman Dimensional Encoding (CDE) for deterministic, O(1)-per-token error-correction decisions.

Accuracy is unmeasured. This repository contains no ground-truth corpus and no CER/WER (character/word error-rate) measurement code, so no accuracy claim is made. See PERFORMANCE.md for the measured timings and known hotspots.

Background

CDE Implementation

Dimensions

Layout: per-token (x, y, width, height, page) coordinates.
Text: raw token string and normalized form.
Confidence: OCR confidence score per token or segment.
Structure: block/line/word hierarchy identifiers and reading order.

Query Workload

ChronoScribe is optimized for:

Deterministic cleaning of hOCR into plain-text or structured formats.
"Given coordinates, find the token/line/block at that position."
"Filter or correct tokens below a confidence threshold."
"Reconstruct reading order from spatial layout and hierarchy."

Minimal Sufficient Statistic

For deterministic OCR cleaning, the minimal sufficient statistic is:

Layout,
Text content,
Confidence, and
Structural hierarchy.

All downstream tasks (reflow, de-noising, format export) are deterministic transforms of these four dimensions, so this CDE is an MSS for the document cleaning workload.

CDE Correction Decision Table

The correction stage (Stage 5) quantizes the per-token data above into a separate 5-axis signature and looks the action up in a pre-computed table:

Axis	Cardinality	Values
Confidence bin	3	high / medium / low
Visual confusion	2	true / false
In dictionary	2	true / false
Layout context	4	end_of_line / start_para / mid_line / isolated
Domain status	2	domain_core / generic

Total table size: 3 × 2 × 2 × 4 × 2 = 96 entries (verified at runtime via CDEDecisionTable.table_size). Each per-token decision is therefore an O(1) dictionary lookup. (Guarantee — derivable from the finite dimension cardinalities and measured: the table reports 96 entries.)

Install

pip install -e .

Requirements

Python 3.11+
lxml (hOCR parsing), wordfreq (context-stage frequency scoring)

Note: the domain stage corrects phrases with a case-insensitive, boundary-aware string scan (no regular expressions, no Aho-Corasick automaton). See ARCHITECTURE.md.

Usage

python -m hocr_clean.cli input.hocr.html -o output.md

Architecture

7-stage pipeline (see hocr_clean/cli.py:process_file): Parse → Layout → Hyphenation → Typography → CDE Confidence → Domain → Context

See ARCHITECTURE.md for system design and PERFORMANCE.md for measured timings.

Development

pip install -e ".[dev]"
pytest tests/
ruff check hocr_clean/
mypy hocr_clean/

Contributing

See CONTRIBUTING.md for code review process.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
docs		docs
hocr_clean		hocr_clean
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PERFORMANCE.md		PERFORMANCE.md
README.md		README.md
pyproject.toml		pyproject.toml
repo-config.yaml		repo-config.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chronoscribe

Background

CDE Implementation

Dimensions

Query Workload

Minimal Sufficient Statistic

CDE Correction Decision Table

Install

Requirements

Usage

Architecture

Development

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Chronoscribe

Background

CDE Implementation

Dimensions

Query Workload

Minimal Sufficient Statistic

CDE Correction Decision Table

Install

Requirements

Usage

Architecture

Development

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages