Skip to content

chronomancy-io/chronoscribe

Repository files navigation

Chronoscribe

OCR text restoration pipeline for digitized historical texts.

standard-readme compliant License WASP v1.0.0 CDE v1.0.0 MSS v1.0.0

Uses Coleman Dimensional Encoding (CDE) for deterministic, O(1)-per-token error-correction decisions.

Accuracy is unmeasured. This repository contains no ground-truth corpus and no CER/WER (character/word error-rate) measurement code, so no accuracy claim is made. See PERFORMANCE.md for the measured timings and known hotspots.

Background

CDE Implementation

Dimensions

  • Layout: per-token (x, y, width, height, page) coordinates.
  • Text: raw token string and normalized form.
  • Confidence: OCR confidence score per token or segment.
  • Structure: block/line/word hierarchy identifiers and reading order.

Query Workload

ChronoScribe is optimized for:

  • Deterministic cleaning of hOCR into plain-text or structured formats.
  • "Given coordinates, find the token/line/block at that position."
  • "Filter or correct tokens below a confidence threshold."
  • "Reconstruct reading order from spatial layout and hierarchy."

Minimal Sufficient Statistic

For deterministic OCR cleaning, the minimal sufficient statistic is:

  • Layout,
  • Text content,
  • Confidence, and
  • Structural hierarchy.

All downstream tasks (reflow, de-noising, format export) are deterministic transforms of these four dimensions, so this CDE is an MSS for the document cleaning workload.

CDE Correction Decision Table

The correction stage (Stage 5) quantizes the per-token data above into a separate 5-axis signature and looks the action up in a pre-computed table:

Axis Cardinality Values
Confidence bin 3 high / medium / low
Visual confusion 2 true / false
In dictionary 2 true / false
Layout context 4 end_of_line / start_para / mid_line / isolated
Domain status 2 domain_core / generic

Total table size: 3 × 2 × 2 × 4 × 2 = 96 entries (verified at runtime via CDEDecisionTable.table_size). Each per-token decision is therefore an O(1) dictionary lookup. (Guarantee — derivable from the finite dimension cardinalities and measured: the table reports 96 entries.)

Install

pip install -e .

Requirements

  • Python 3.11+
  • lxml (hOCR parsing), wordfreq (context-stage frequency scoring)

Note: the domain stage corrects phrases with a case-insensitive, boundary-aware string scan (no regular expressions, no Aho-Corasick automaton). See ARCHITECTURE.md.

Usage

python -m hocr_clean.cli input.hocr.html -o output.md

Architecture

7-stage pipeline (see hocr_clean/cli.py:process_file): Parse → Layout → Hyphenation → Typography → CDE Confidence → Domain → Context

See ARCHITECTURE.md for system design and PERFORMANCE.md for measured timings.

Development

pip install -e ".[dev]"
pytest tests/
ruff check hocr_clean/
mypy hocr_clean/

Contributing

See CONTRIBUTING.md for code review process.

License

Apache-2.0 © 2026 Jacob Coleman — See LICENSE for details.

About

A seven stage Python pipeline that cleans Internet Archive hOCR scans into Markdown. CDE makes each token correction a deterministic O(1) table lookup. Linear in token count, about 110 pages a second.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages