Skip to content

Modern philosophy: the Project Gutenberg shelf, cleaned and chapterized#3

Merged
adoistic merged 3 commits into
mainfrom
feat/gutenberg-philosophy
Jun 11, 2026
Merged

Modern philosophy: the Project Gutenberg shelf, cleaned and chapterized#3
adoistic merged 3 commits into
mainfrom
feat/gutenberg-philosophy

Conversation

@adoistic

Copy link
Copy Markdown
Owner

The library held the ancients; this brings the moderns.

83 philosophy works from Project Gutenberg's philosophy shelf, deduped against the corpus (every Plato dialogue, Aristotle treatise and Boethius we already hold was dropped) and curated by an agent with an adversarial fact-check: Descartes, Spinoza, Locke, Hume, Berkeley, Kant, Schiller, Machiavelli, Mill, Marx, Nietzsche, William James, Russell, Dewey, Bergson, plus histories of Indian, Greek and modern philosophy. The German originals of Zarathustra and the Critique of Pure Reason are paired alongside.

Cleaning. A new ingester strips the PG header/license/footer boilerplate (including the older "End of Project Gutenberg's X" footers), "Produced by" and "Transcriber's Note" apparatus, HTML tags and entities, multi-paragraph footnotes, page markers, and ASCII-art table runs; it drops the table of contents and splits on each book's own headings with a numeral parser that collapses TOC against body, rejects prose cross-references mistaken for headings (which had mis-split Spinoza's Ethics), and falls back to a single clean chapter when an edition is unparseable. A two-agent test pass verified zero boilerplate/markup leaks across all works and correct openings for The Prince, Kant, Hume, Mill, Nietzsche and Spinoza.

Corpus: 945 works, 9,067 chapters. New Renaissance era on the timeline (Machiavelli, Montaigne). Project Gutenberg credited on /about#sources with its trademark notice; every chapter links to its ebook page.

MCP 0.2.1. FTS search now ranks results by relevance. The first-run corpus download points at the new corpus-2026-06-11b release (945 works), cold-tested end to end.

Merging deploys falsafa.ai. After merge, npm publish in apps/mcp ships 0.2.1.

🤖 Generated with Claude Code

Adnan Abbasi and others added 3 commits June 11, 2026 19:34
The library held the ancients but not the moderns. This adds the
Project Gutenberg philosophy shelf, deduped against the corpus (every
Plato dialogue, Aristotle treatise and Boethius we already hold was
dropped) and curated by an agent with an adversarial check: Descartes,
Spinoza, Locke, Hume, Berkeley, Kant, Schiller, Machiavelli, Mill,
Marx, Nietzsche, William James, Russell, Dewey, Bergson, plus
histories of Indian, Greek and modern philosophy. 83 English works,
with the German originals of Zarathustra and the Critique of Pure
Reason paired alongside (the French Descartes and a few bundled
multi-work editions were auto-rejected by a word-count guard).

scripts/gutenberg/ingest.ts fetches the plain-text ebooks, strips the
PG header/license/footer boilerplate (verified: zero 'Project
Gutenberg' strings reach any chapter body), drops the table of
contents, and splits on the book's own headings with a numeral parser
that collapses TOC against body, ignores stray roman sub-items when
real chapter keywords exist, rejects clustered index listings, and
falls back to a single clean chapter when an edition's structure is
unparseable (Zarathustra's apparatus-heavy edition). On Liberty splits
to 5, The Prince to 26, Beyond Good and Evil to 9, Spinoza's Ethics to
5 parts. PG italics become markdown emphasis; page and footnote markers
are stripped.

Corpus: 945 works, 9,092 chapters. New Renaissance era (Machiavelli,
Montaigne) added to the timeline. Project Gutenberg credited on
/about#sources with its trademark notice; every chapter links to its
ebook page. search.db rebuilt (764,195 paragraphs).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…pinoza Ethics

The text-quality audit caught what the first pass missed. Fixed in the
cleaner and re-ingested:
- older PG plain-text footers ('End of Project Gutenberg's X, by Y')
  that have no *** END *** marker (leaked into 17 works)
- 'Produced by ... HTML version by ...' credit blocks and
  '[Transcriber's Note ...]' apparatus (16 works)
- residual HTML tags (<i>1</i> in Locke), stray entities, and
  multi-paragraph and plural '[Footnotes ...]' notes
- ASCII-art table/diagram underscore runs in the logic works (Carroll's
  Symbolic Logic) — both the rules and the mostly-non-letter blocks they
  sat in are dropped
- Spinoza's Ethics was opening on Part II because a 'Part I' mentioned
  inside Part V's prose hijacked the table-of-contents dedup; headings
  whose trailing text is lowercase prose rather than an uppercase title
  are now rejected, so the five parts lead in order (I Concerning God
  first, opening on the Definitions)

Verified: zero Project Gutenberg / Produced by / Transcriber / HTML /
[Footnote / underscore-run hits across all 85 works. Corpus 945 works,
9,067 chapters; search.db rebuilt (761,829 paragraphs).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…hy shelf

search_corpus now orders FTS5 hits by relevance (ORDER BY rank) instead
of rowid order, so the best matches come first across the whole 945-work
corpus. The first-run download points at the corpus-2026-06-11b release
(945 works, new sha256). Cold-tested end to end from the packed 0.2.1
tarball: download, checksum, index build, then list_works and a ranked
philosophy search over live stdio.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
falsafa-site Ready Ready Preview, Comment Jun 11, 2026 2:39pm

@adoistic adoistic merged commit 2f34feb into main Jun 11, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant