Modern philosophy: the Project Gutenberg shelf, cleaned and chapterized#3
Merged
Conversation
The library held the ancients but not the moderns. This adds the Project Gutenberg philosophy shelf, deduped against the corpus (every Plato dialogue, Aristotle treatise and Boethius we already hold was dropped) and curated by an agent with an adversarial check: Descartes, Spinoza, Locke, Hume, Berkeley, Kant, Schiller, Machiavelli, Mill, Marx, Nietzsche, William James, Russell, Dewey, Bergson, plus histories of Indian, Greek and modern philosophy. 83 English works, with the German originals of Zarathustra and the Critique of Pure Reason paired alongside (the French Descartes and a few bundled multi-work editions were auto-rejected by a word-count guard). scripts/gutenberg/ingest.ts fetches the plain-text ebooks, strips the PG header/license/footer boilerplate (verified: zero 'Project Gutenberg' strings reach any chapter body), drops the table of contents, and splits on the book's own headings with a numeral parser that collapses TOC against body, ignores stray roman sub-items when real chapter keywords exist, rejects clustered index listings, and falls back to a single clean chapter when an edition's structure is unparseable (Zarathustra's apparatus-heavy edition). On Liberty splits to 5, The Prince to 26, Beyond Good and Evil to 9, Spinoza's Ethics to 5 parts. PG italics become markdown emphasis; page and footnote markers are stripped. Corpus: 945 works, 9,092 chapters. New Renaissance era (Machiavelli, Montaigne) added to the timeline. Project Gutenberg credited on /about#sources with its trademark notice; every chapter links to its ebook page. search.db rebuilt (764,195 paragraphs). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…pinoza Ethics
The text-quality audit caught what the first pass missed. Fixed in the
cleaner and re-ingested:
- older PG plain-text footers ('End of Project Gutenberg's X, by Y')
that have no *** END *** marker (leaked into 17 works)
- 'Produced by ... HTML version by ...' credit blocks and
'[Transcriber's Note ...]' apparatus (16 works)
- residual HTML tags (<i>1</i> in Locke), stray entities, and
multi-paragraph and plural '[Footnotes ...]' notes
- ASCII-art table/diagram underscore runs in the logic works (Carroll's
Symbolic Logic) — both the rules and the mostly-non-letter blocks they
sat in are dropped
- Spinoza's Ethics was opening on Part II because a 'Part I' mentioned
inside Part V's prose hijacked the table-of-contents dedup; headings
whose trailing text is lowercase prose rather than an uppercase title
are now rejected, so the five parts lead in order (I Concerning God
first, opening on the Definitions)
Verified: zero Project Gutenberg / Produced by / Transcriber / HTML /
[Footnote / underscore-run hits across all 85 works. Corpus 945 works,
9,067 chapters; search.db rebuilt (761,829 paragraphs).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…hy shelf search_corpus now orders FTS5 hits by relevance (ORDER BY rank) instead of rowid order, so the best matches come first across the whole 945-work corpus. The first-run download points at the corpus-2026-06-11b release (945 works, new sha256). Cold-tested end to end from the packed 0.2.1 tarball: download, checksum, index build, then list_works and a ranked philosophy search over live stdio. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The library held the ancients; this brings the moderns.
83 philosophy works from Project Gutenberg's philosophy shelf, deduped against the corpus (every Plato dialogue, Aristotle treatise and Boethius we already hold was dropped) and curated by an agent with an adversarial fact-check: Descartes, Spinoza, Locke, Hume, Berkeley, Kant, Schiller, Machiavelli, Mill, Marx, Nietzsche, William James, Russell, Dewey, Bergson, plus histories of Indian, Greek and modern philosophy. The German originals of Zarathustra and the Critique of Pure Reason are paired alongside.
Cleaning. A new ingester strips the PG header/license/footer boilerplate (including the older "End of Project Gutenberg's X" footers), "Produced by" and "Transcriber's Note" apparatus, HTML tags and entities, multi-paragraph footnotes, page markers, and ASCII-art table runs; it drops the table of contents and splits on each book's own headings with a numeral parser that collapses TOC against body, rejects prose cross-references mistaken for headings (which had mis-split Spinoza's Ethics), and falls back to a single clean chapter when an edition is unparseable. A two-agent test pass verified zero boilerplate/markup leaks across all works and correct openings for The Prince, Kant, Hume, Mill, Nietzsche and Spinoza.
Corpus: 945 works, 9,067 chapters. New Renaissance era on the timeline (Machiavelli, Montaigne). Project Gutenberg credited on /about#sources with its trademark notice; every chapter links to its ebook page.
MCP 0.2.1. FTS search now ranks results by relevance. The first-run corpus download points at the new corpus-2026-06-11b release (945 works), cold-tested end to end.
Merging deploys falsafa.ai. After merge,
npm publishin apps/mcp ships 0.2.1.🤖 Generated with Claude Code