Skip to content

The library expansion: two surfaces, the atlas and book, and the whole of Perseus#2

Merged
adoistic merged 12 commits into
mainfrom
feat/reader-engine-split-atlas
Jun 11, 2026
Merged

The library expansion: two surfaces, the atlas and book, and the whole of Perseus#2
adoistic merged 12 commits into
mainfrom
feat/reader-engine-split-atlas

Conversation

@adoistic

Copy link
Copy Markdown
Owner

Nine commits, one day's work, three phases.

Two surfaces, one site. The reader surface (Read, Atlas, Book, About) is the default; the technical artifact lives behind /engine with the existing URLs untouched. The homepage leads with the library and closes with why it exists.

The justification layer. The Naql transmission atlas renders at /atlas (27 works, 77 transmissions, 99 people, isnad-graded, reaching the year 2026 with the project's own crossings as entries) and the book Carried Across reads at /book, thirteen chapters including two written from the corpus through the project's own MCP. Atlas and library link each other both ways. All historical claims passed a ten-agent adversarial verification pass; the test fleet's findings (including the chapters misquoting their own earlier chapters) were repaired before landing.

The whole of Perseus. Every work in the PerseusDL canonical repos with an English translation: 879 works, 8,456 chapters, with the Greek or Latin original paired beside the translation on 793 of them (First1KGreek fills where the canonical repos lack editions). Cicero's letter collections are 445 and 454 individually citable letters. Plus David Hart's public-domain shelf: Smith's Wealth of Nations (Cannan 1904), Mill's Principles, Paine's Common Sense, the 1904 Molinari (his own 2025-26 translations deliberately excluded; no license on the site).

Search at scale. SQLite FTS5 over 705k paragraphs (the plumbing the launch eng review locked): MCP queries at 1-9ms against ~600ms scans, Greek queries hitting Greek, legacy scan preserved as fallback. corpus/search.db is gitignored and rebuilt by scripts/build-search-index.ts. Pagefind: 9,592 fragments.

Merging deploys falsafa.ai. Two things to know first: @falsafa/mcp can no longer bundle this corpus in its tarball (distribution needs the CDN/lazy plan before the next publish), and the Vercel build must run scripts/build-search-index.ts or skip it gracefully (the MCP falls back to scan without it).

🤖 Generated with Claude Code

Adnan Abbasi and others added 9 commits June 11, 2026 10:50
Locks the shape for the three-front expansion: two surfaces with no URL
changes, the atlas and Carried Across as the justification layer, and the
first Perseus tranche through the existing works.json -> convert.ts
methodology via an additive perseus-works.json input.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The site now has two surfaces with no URL changes: a reader-centric
default (Read, Atlas, Book, About) and an engine room (/engine hub
linking /try, /eval, /thesis, /numbers) with a quiet crossing in the
header. Homepage gains the justification block; /about and /thesis
gain cross-links; README gains the why-paragraph and launch rows.

The Naql transmission atlas lives at /atlas: dataset under atlas/data
(CC BY 4.0), stemma renderer re-skinned to falsafa tokens via an
.atlas-scope bridge (light/dark/sepia follow the site), md siblings
and /atlas/graph.json ported, validator at scripts/validate-atlas.ts.

Carried Across lives at /book with a new closing chapter, 'Afterword:
why Falsafa': the book's argument stated as the reason this project
exists, chapter by chapter mapped to design decisions.

Review fixes folded in: chapter bylines honor the frontmatter
translator instead of hardcoding Thothica (links only Thothica's own
credits), prev/next reader nav falls back when a neighbor lacks the
current variant, /eras/ index page added (was 404 from /numbers),
footer credit made conditional, .gitignore inline-comment bug fixed so
apps/mcp/corpus/ is actually ignored, house-style pass on all new copy.

Fonts: Amiri, Noto Serif Devanagari/Hebrew, Noto Sans Syriac, IBM Plex
Mono via fontsource for the atlas original scripts.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…and search

scripts/perseus/ingest.ts fetches each work's __cts__.xml from the
PerseusDL canonical repos, picks the English translation URN, credits
the real translator (Fowler, Godley, Shorey, Rackham, Lamb, Dale;
catalog-silent cases overridden in tranche-1.json: Butler rev. Perseus
for Homer, Leonard for Lucretius, Williams for the Aeneid), and
flattens TEI to chapters: book-level divisions win when present,
shorter dialogues become single chapters.

Content quality, after adversarial review: TEI <reg> gazetteer
expansions stripped (Herodotus no longer hails from Bodrum with
coordinates), prose reflowed so soft source wraps stop becoming
line-level paragraph IDs (Thucydides Book 1: 4,162 fragments -> 616
real paragraphs), words rejoined across source hyphen breaks,
punctuation-only paragraphs dropped, verse detection now requires
verse lines to dominate decisively (Republic 2-3 back to prose).

scripts/perseus/apply.ts applies the tranche additively: corpus/
carries post-convert pipeline output (chapter splits, wiki cards,
sidecars) that regeneration would destroy, so convert.ts gained
--works/--audit/--out flags and a per-chapter translator override
instead of an implicit merge.

Licensing recorded: TEI encodings CC BY-SA 4.0, translations public
domain; every chapter carries its Scaife reader source_url.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…links

The library now holds two of the atlas's own books: Hippocrates'
Aphorisms (tr. Francis Adams) and Euclid's Elements (tr. Thomas Little
Heath), via a second Perseus tranche; the ingester now reads every
scripts/perseus/tranche-*.json.

Reconciliation runs both ways through a new optional read_slug field on
atlas works: atlas work pages link 'Read this work in the library', and
reader pages (work index and chapter pages, covering single-chapter
redirect works) link 'Trace this book's journey in the atlas'.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The atlas now maps the library's own books: Homer (Pilatus for
Boccaccio, Chapman, Pope, Butler), Republic (Ficino, Jowett, Shorey),
Nicomachean Ethics (the disputed Arabic credit, Grosseteste, Rackham),
Thucydides (Valla for Nicholas V, Hobbes, Dale), Herodotus (Valla,
Godley), Lucretius (Creech, Leonard, with Poggio's 1417 find in the
work detail), Aeneid (Dryden, Williams), Manusmriti (Jones 1794
Calcutta, Buhler 1886), and four modern chains that carry the atlas
into the present: Comte's Traite de legislation, Dunoyer's Nouveau
traite, Fichte's Zuruckforderung and the Diwan-e-Ghalib, each crossing
into English at New Delhi in 2026 with David Hart and Adnan Abbasi
recorded as carriers. Dataset: 27 works, 77 transmissions, 98 people,
13 languages (French, German, Urdu added), 21 places (Florence, Paris,
Calcutta, New Delhi, Boston added), 69 sources. Year cap raised to
2100; field enum gains poetry, history, law, economics. All new works
carry read_slug, so every chain links to its readable copy and back.

The book gains two chapters written from the corpus through our own
MCP: 'Carried by mention' (Comte, Dunoyer, Fichte: books famous in
citation and absent in translation, with the wiki layer's cosine
arithmetic finding Fichte's nearest neighbor is Comte) and 'The
fascination clause' (the modern crossing: Ghalib's radif kept in
English, the AI-assisted workshop under Hunayn's Risala standard, the
carrier who can read his own atlas entry). Afterword renumbered to 12.

/about#sources now credits David Hart's Digital Library of Liberty and
Power and the Perseus Digital Library by name.

Claims authored conservatively pending the adversarial verification
pass (running); corrections land as a follow-up.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…d; Hart kept credit-only

Ten research agents audited the new chains and chapters. Corrections
applied: Chapman's Odyssey tightened to 1614-15; Pilatus's Odyssey
finished by 1362; Butler's Perseus revisers named (Timothy Power and
Gregory Nagy, from the TEI headers) in corpus credits and chain notes;
the Arabic Ethics credited as composite (Ishaq books I-IV, Ustath of
al-Kindi's circle V-X; new person record); Grosseteste hedged to first
complete Latin to survive and circulate; Dale corrected to 1848-49 and
upgraded to attested; Creech qualified as first PUBLISHED complete
English (Hutchinson's manuscript stayed unprinted until 1996); the
Lucretius detail no longer implies O and Q enabled the 1417 find;
Jones reframed as proposing the project to Cornwallis himself, with
his 1785 'mercy of our pandits' letter; Republic phrased as 'no
complete Arabic translation survives'.

The big catch: complete English Ghalib translations exist (Niazi 2002,
Rahman 2003), so chapter 11 and the atlas entry now claim what is
true: the form at scale, refrains intact, originals on every line.
Chapter 10 gains the attested Heliopolis false imprint ('in the last
year of the old darkness', actually Danzig) and Mill quoting Dunoyer;
chapter 11 gains Azad fleeing the 1857 sack with Zauq's ghazals.

Hart licensing verdict: no license statement exists anywhere on
davidmhart.com, so his own 2025-26 translations stay OUT (credit-only,
permission email drafted for Adnan). scripts/hart/ingest.ts ingests
public-domain editions only; first: Molinari's Society of Tomorrow
(1904 Lee Warner translation), 32 chapters, 50k words. Corpus: 52
works. apply.ts generalized to any works/audit input pair.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ir own book

The style-and-claims tester caught the chapters inventing
back-references: phantom patrons attributed to chapter four (Hunayn
never had caliphs there; ch3 says his work was privately
commissioned), a 'rarest pattern' chapter five never claimed, a
machines-check-librarians line chapter nine never wrote, London
subscription lists the book never mentioned, and a Comte volume
undercount. All grounded now in what the chapters actually say
(the caliph's rush order for the Topics, the Bakhtishu commissions,
Cambridge's debtors' prison letter, Comte's six volumes), with
negative-parallelism and tricolon density thinned, the Comte quote's
silent truncation marked, and Zauq's radif quoted as the corpus
actually has it ('pain, it struck home').

Molinari's Society of Tomorrow now credits its real translator,
P. H. Lee Warner (1904), instead of thothica: the Hart ingester sets
per-chapter translator credits the way the Perseus one does.

All four testers otherwise green: zero broken links across 6,013
hrefs on the new surfaces, all 13 reconciliation pairs verified both
ways, pagefind serving Aphorisms and Molinari content, corpus
integrity 52/52 works with clean bodies, naql parity build green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… works

Per Adnan: nothing Hart translated himself, only what is originally
English or out-of-copyright. Ingested from the Digital Library of
Liberty and Power's Guillaumin reader editions: Paine's Common Sense
(1776 first-edition text, 4 chapters), Adam Smith's Wealth of Nations
(Cannan edition 1904, 36 chapters, 390k words) and Mill's Principles
of Political Economy (7th ed. 1871, 76 chapters, 392k words).

The Hart ingester gained an original_english mode (stored as
'original' variants, no translator credit, description without a
translation line), a heading_allow regex for the multi-pane reader
pages that mix chrome with content, and cleanup passes for Cannan-style
margin-note spans, bracketed endnote anchors, and edition page markers
([ I-6 ], [ II-410 ], [ 113 ]); the bracket inventory after ingestion
is one [English] and one numeric remnant across 780k words.

ChapterBody's pagefind condition now indexes native-English originals
alongside translations, matching the MCP's english search scope.
Build: 2,205 pages indexed, +116 = exactly the new chapters.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…de, FTS5 search

The corpus now carries every work in the PerseusDL canonical repos
with an English translation: 826 archive works ingested (only 5 minor
Homeric Hymns yielded no extractable chapters), joining the curated
catalog for 879 works, 8,456 chapters, ~751MB of text. 793 works pair
the source-language original with the translation; the originals come
from the canonical repos and, where those carry no edition (Didache,
the Clement epistles, the apostolic-era texts), from
OpenGreekAndLatin/First1KGreek. Greek searches hit Greek: the index
answers the Odyssey's first words in under a millisecond.

Pipeline: scripts/perseus/enumerate.ts walks blobless clones of both
canonical repos; ingest-archive.ts processes the worklist in batches
of 25 with applies after every batch, a 2.5GB disk watermark with a
resumable remainder file, catalog-driven metadata (textgroup author,
edition translator credits), and a URN-to-file fallback that recovered
Trachiniae and Athenaeus. Chapter divisions honor book, chapter, poem,
act, and now letter, which turned Cicero's two letter collections from
single 200k-word slabs into 445 and 454 individually citable letters.
Shared TEI flattening lives in scripts/perseus/lib.ts.

Search plumbing (the SQLite FTS5 decision from the Perseus launch eng
review, now real): scripts/build-search-index.ts builds
corpus/search.db (705,114 paragraphs, 424MB, gitignored, rebuilt by
script) over the paragraph sidecars; the MCP's search_corpus uses it
through node:sqlite in production and bun:sqlite from source, phrase
match first, token AND second, legacy filesystem scan as the fallback
for index misses and case-sensitive queries. Measured: 597ms scan to
2.6ms FTS on the same query. Pagefind on the site side: 9,592 indexed
fragments, up from 2,107.

Tests: corpus integrity sampled at scale (zero TEI artifacts in 461
sampled variants), MCP smoke 8/8 including Greek original search,
work-scoped queries, and paragraph round-trips.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
falsafa-site Ready Ready Preview, Comment Jun 11, 2026 1:02pm

… first run

The corpus outgrew the tarball, so the package now ships code only
(104 KB, was ~104 MB). On first run the server downloads the
corpus-2026-06-11 release asset (185 MB compressed, 879 works),
verifies its sha256, extracts to the user cache (XDG/macOS Caches),
and builds the FTS5 search index locally via node:sqlite, falling back
to the filesystem scan where sqlite is unavailable. All progress on
stderr; stdout stays MCP-clean.

Cold-tested end to end from the packed tarball with an empty cache:
download, checksum, extraction, index build, then over live stdio:
initialize (0.2.0), list_works genre=Classics -> 824, search_corpus
'wrath of Achilles' -> fts5 engine, 3 hits, 1 ms, get_passage
round-trip, and Greek 'μῆνιν ἄειδε' -> the Iliad's original variant.

prepack no longer copies the corpus; READMEs tell the truth about the
download size.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…Livy mended

One regex family caused three classes of damage: some Perseus catalog
files write <ti:edition> and some write plain <edition> in the same
namespace, and the parsers only matched the prefixed form. Where a
catalog used the bare form we lost the source-language edition (54
Latin works without originals, Virgil's Eclogues among them), the
textgroup author (52 archive works credited to 'Unknown' when the
catalog knew Petronius, Phaedrus, Livy), and translator credits
(Epictetus' Fragments are Thomas Wentworth Higginson's, the Eclogues
James Rhoades's). All catalog regexes now accept both forms.

53 affected works re-ingested with corrected authors, credits and
originals; 63 stale slugs pruned (the author is part of the slug, so a
fixed author means a new address) via the new scripts/perseus/prune.ts,
including ten per-book Livy fragments that duplicated Ab urbe condita
books 1-10. Livy now stands as Titus Livius: Ab urbe condita, ten
books, 275k words, Latin paired with the Roberts translation, with the
periochae of the lost books and books 21-45 properly attributed
alongside. Corpus: 860 works, 8,426 chapters; search.db rebuilt
(709,276 paragraphs). Verified: 'Tityre tu patulae' hits the Eclogues
Latin original through FTS5.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ll 'Ancient'

The full-archive ingester labelled every Perseus work 'Ancient' as a
placeholder. A philologist agent classified all 80 authors by floruit
into five periods, an adversarial checker verified the dates against
reference scholarship (one correction: Dinarchus is early Hellenistic,
not Classical, since he kept working after 321 BCE), and the result is
applied here.

Boundaries: Classical (to 323 BCE), Hellenistic (323-31 BCE, the
Hellenistic Greek world and the Roman Republic), Imperial (31 BCE-284
CE, Augustan Rome, the high Empire, the Second Sophistic, the New
Testament and Apostolic Fathers), Late Antiquity (284-600 CE),
Medieval (after 600). Distribution: Classical 308, Hellenistic 120,
Imperial 364, Late Antiquity 26, Medieval 15; the Indic smritis keep
their own 'Ancient' (a different civilization, correctly not folded
into the Greek scheme).

Two 'Unknown' Latin works got their real authors back along the way:
the Ecclesiastical History of the English Nation is Bede (Medieval),
and the Loeb 'Select Letters' (stoa0040.stoa011) are Augustine's
Epistulae (Late Antiquity).

scripts/perseus/reclassify-eras.ts patches index.md frontmatter and
the manifest facet from era-map.json. The homepage strip and /eras
order the new periods chronologically, each with an editorial intro.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@adoistic adoistic merged commit 9e90620 into main Jun 11, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant