OCR text restoration pipeline for digitized historical texts.
Uses Coleman Dimensional Encoding (CDE) for deterministic, O(1)-per-token error-correction decisions.
Accuracy is unmeasured. This repository contains no ground-truth corpus and no CER/WER (character/word error-rate) measurement code, so no accuracy claim is made. See PERFORMANCE.md for the measured timings and known hotspots.
- Layout: per-token
(x, y, width, height, page)coordinates. - Text: raw token string and normalized form.
- Confidence: OCR confidence score per token or segment.
- Structure: block/line/word hierarchy identifiers and reading order.
ChronoScribe is optimized for:
- Deterministic cleaning of hOCR into plain-text or structured formats.
- "Given coordinates, find the token/line/block at that position."
- "Filter or correct tokens below a confidence threshold."
- "Reconstruct reading order from spatial layout and hierarchy."
For deterministic OCR cleaning, the minimal sufficient statistic is:
- Layout,
- Text content,
- Confidence, and
- Structural hierarchy.
All downstream tasks (reflow, de-noising, format export) are deterministic transforms of these four dimensions, so this CDE is an MSS for the document cleaning workload.
The correction stage (Stage 5) quantizes the per-token data above into a separate 5-axis signature and looks the action up in a pre-computed table:
| Axis | Cardinality | Values |
|---|---|---|
| Confidence bin | 3 | high / medium / low |
| Visual confusion | 2 | true / false |
| In dictionary | 2 | true / false |
| Layout context | 4 | end_of_line / start_para / mid_line / isolated |
| Domain status | 2 | domain_core / generic |
Total table size: 3 × 2 × 2 × 4 × 2 = 96 entries (verified at runtime via
CDEDecisionTable.table_size). Each per-token decision is therefore an O(1) dictionary
lookup. (Guarantee — derivable from the finite dimension cardinalities and measured: the
table reports 96 entries.)
pip install -e .- Python 3.11+
- lxml (hOCR parsing), wordfreq (context-stage frequency scoring)
Note: the domain stage corrects phrases with a case-insensitive, boundary-aware string scan (no regular expressions, no Aho-Corasick automaton). See ARCHITECTURE.md.
python -m hocr_clean.cli input.hocr.html -o output.md7-stage pipeline (see hocr_clean/cli.py:process_file):
Parse → Layout → Hyphenation → Typography → CDE Confidence → Domain → Context
See ARCHITECTURE.md for system design and PERFORMANCE.md for measured timings.
pip install -e ".[dev]"
pytest tests/
ruff check hocr_clean/
mypy hocr_clean/See CONTRIBUTING.md for code review process.
Apache-2.0 © 2026 Jacob Coleman — See LICENSE for details.