LoreGraph

knowledge graphs that quote the page they came from

LoreGraph turns a novel, play, screenplay, or libretto into a queryable knowledge graph — characters, objects, events and concepts, the typed relations between them, and the facts each implies — where every single claim is anchored to a literal span of the source text.

No hallucinated edges. No "trust me." Click any relationship and you land on the exact sentence it came from.

Apache-2.0 · Python 3.11+ · 8-pass pipeline · multi-LLM · DeepSeek V4 Pro default · multilingual · Alpha

English · 简体中文

Why · What you get · Pipeline · Quick start · Architecture · Corpus · Roadmap

flowchart LR
    SRC([book.txt]):::io
    subgraph EXTRACT["EXTRACTION · every claim is anchored to a literal source span"]
        direction LR
        P1["1 · Chunk"]:::det
        P2["2 · Entity"]:::llm
        P3["3 · Resolve"]:::llm
        P4["4 · Coref"]:::det
        P5["5 · Relations"]:::llm
        P6["6 · GLUCOSE"]:::llm
        P7["7 · Verify"]:::gate
        P8["8 · Note"]:::llm
        P1 --> P2 --> P3 --> P4 --> P5 --> P6 --> P7 --> P8
    end
    GRAPH[("knowledge graph<br/>Postgres · pgvector")]:::store
    SRC --> P1
    P8 --> GRAPH

    classDef io fill:#f3ecdd,stroke:#8a6d3b,color:#3a2d18
    classDef det fill:#e3d3ad,stroke:#8a6d3b,color:#3a2d18
    classDef llm fill:#c9a253,stroke:#6b5126,color:#2a2008
    classDef gate fill:#b34a2f,stroke:#6e2616,color:#ffffff
    classDef store fill:#33414f,stroke:#1a232c,color:#eef2ff

1, 4 are deterministic · 2, 3, 5, 6, 8 are LLM passes · 7 is the ≥95% literal-match verification gate.

Why LoreGraph

Most "knowledge graph from text" tools extract triples and ask you to trust them. For fiction that's fatal: a graph that quietly invents a relationship is worse than no graph. LoreGraph is built on one non-negotiable rule:

Every extracted claim carries an evidence_span — a literal substring of the source — and a chain-of-verification pass rejects any claim whose span isn't a ≥95% literal match.

It is also closed-world: the model may only use the text in front of it. It is told, explicitly, to forget what it knows about the real "Elizabeth Bennet" or "孫悟空" and report only what this book says.

And it is multilingual end-to-end: the 85-work reference corpus spans English, 中文, Русский, Deutsch, Français, Italiano, 日本語, Ελληνικά, and more. Source text stays in its original script; entity resolution runs on a multilingual embedding model so that "林黛玉" / "颦儿" or "the Dark Lord" / "Voldemort" merge into one node even with zero string overlap.

What you get

For every book, a web reading-room with five linked views — all driven by the same evidence-anchored graph:

View	What it shows
📖 Reader	The original text, with every entity mention highlighted and clickable.
🕸 Graph	A force-directed character/object/event/concept network. Hover any edge → the source line.
⏳ Timeline	The story's events in reading order (the graph carries story-time on every fact).
📇 Index	Searchable entity catalogue with per-entity profile cards.
💬 Ask	Question-answering grounded in the graph — every answer cites its evidence.

Each entity gets a structured Hybrid Note: [CONTEXT] [FACTS] [INFERENCES] [BEHAVIOR_PATTERN] [GAPS] [EVIDENCE] — facts kept strictly separate from inferences, every inference tagged with a confidence level.

The pipeline

A book flows through eight passes (diagram above). Chunking is deterministic; the rest are LLM calls that all route through one hardened client.

#	Pass	What it does
1	Chunk	Deterministic, chapter-aware splitter (English and CJK "第N回" headers). Stamps each chunk with a global story-time position.
2	Entity	Extracts typed mentions (Agent / Object / Event / Concept) with a literal evidence span. Uses gleaning (a "what did you miss?" retry) for recall.
3	Resolve	Production entity resolution: lexical + embedding-kNN blocking → batched LLM matching → connected components → a black-hole-prevention sanity pass. Merges aliases across scripts.
4	Coref	Links every mention to its canonical entity.
5	Relations	Five typed relations (STRUCTURAL / INTERACTS / ASSERTS / INFLUENCES / PREDICTS) + a predicate, weight and sentiment, each with evidence.
6	GLUCOSE	Implicit commonsense facts (cause / emotion / location / possession / attribute) about each entity.
7	Verify	Chain-of-verification: drops any claim whose evidence span isn't a literal match. Hard ≥95% gate.
8	Note	Synthesises the per-entity Hybrid Note, assigns a subtype and an importance tier.

Production engineering (researched against Splink / ComEM / GraphRAG and the Anthropic + OpenRouter docs):

Prompt caching on the stable system prompt — on Anthropic models (direct or anthropic/* via OpenRouter) repeat calls reuse the cached prefix (≈10× cheaper input); other providers fall back to their own server-side caching.
Bounded-parallel per-chunk LLM calls (≈10× faster than sequential) with retry + backoff + jitter.
Per-pass commits + idempotent re-runs — a failed pass resumes with --from N; nothing double-writes.
Provider-agnostic client: DeepSeek V4 Pro via OpenRouter by default — one key, any mainstream model — swappable to 15+ backends (Anthropic, OpenAI, DeepSeek, Gemini, local, …).

Quick start

# 1. Install (uv)
uv sync

# 2. Postgres 16+ with pgvector
createdb loregraph && psql loregraph -c "CREATE EXTENSION IF NOT EXISTS vector;"
uv run alembic upgrade head

# 3. Configure (.env)
cp .env.example .env        # set LOREGRAPH_LLM_PROVIDER + your API key
                            # default: openrouter + deepseek/deepseek-v4-pro

# 4. Ingest a public-domain text and extract its graph
uv run loregraph ingest path/to/book.txt --title "Pride and Prejudice" --author "Jane Austen" --language en
uv run loregraph extract --book-id 1        # runs passes 1–8
uv run loregraph status --book-id 1         # pass-by-pass progress, tokens, cost

# 5. See it
uv run loregraph view                        # optional local FastAPI dev server (the public site is static)

Cost & speed. Every call's tokens and cost land in pass_runs.stats. A mid-size novel runs in minutes, not hours, thanks to concurrency + caching. A per-book budget ceiling (LOREGRAPH_COST_CEILING_USD, default $20) is enforced between passes.

Architecture

flowchart TB
    SRC([book.txt]):::io

    subgraph PIPE["loregraph.pipeline · 8 passes wired by orchestrator.py"]
        EX["chunk · entity · resolve · coref<br/>relations · GLUCOSE · verify · note"]:::llm
    end

    subgraph SVC["loregraph.llm · the only path to a model"]
        direction LR
        CLIENT["hardened client<br/>prompt-cache · retries · 15+ providers"]:::accent
        EMBED["multilingual embedder<br/>e5-large · 1024-dim · local"]:::accent
    end

    DB[("loregraph.db · Postgres + pgvector<br/>entities · edges · facts · notes · chunks")]:::store
    EXP["scripts/export_book.py"]:::det
    JSON[/"data/exports/&lt;book&gt;.json"/]:::io
    WEB["loregraph.web · reading-room<br/>Reader · Graph · Timeline · Index · Ask"]:::web

    SRC --> PIPE
    PIPE <--> SVC
    PIPE --> DB
    DB --> EXP --> JSON --> WEB

    classDef io fill:#f3ecdd,stroke:#8a6d3b,color:#3a2d18
    classDef det fill:#e3d3ad,stroke:#8a6d3b,color:#3a2d18
    classDef llm fill:#c9a253,stroke:#6b5126,color:#2a2008
    classDef accent fill:#d9b978,stroke:#6b5126,color:#2a2008
    classDef store fill:#33414f,stroke:#1a232c,color:#eef2ff
    classDef web fill:#7a5230,stroke:#3a2614,color:#ffffff

src/loregraph/pipeline/ — the passes, one module each, wired by orchestrator.py.
src/loregraph/llm/ — the single LLM client (caching, retries, multi-provider) + the local multilingual embedder.
src/loregraph/db/ — SQLAlchemy 2.0 schema + async repository. Migrations under migrations/.
src/loregraph/web/ — FastAPI API + the landing / reading-room front-end.
Full design in docs/architecture.md.

The corpus

LoreGraph ships with a reference set of 85 canonical works — novels, plays, operas and early films across 11 languages (Pride and Prejudice · 西游记 · Crime and Punishment · Faust · Les Misérables · …).

Copyright is respected, strictly. Source texts are never committed — data/books/ is git-ignored. Only derived metadata (the graph, short fair-use evidence spans, profile notes) is published, and full reading text is embedded only for public-domain works. In-copyright works are processed locally and surfaced as graph + analysis only.

Configuration

Variable	Default	Notes
`LOREGRAPH_LLM_PROVIDER`	`openrouter`	`anthropic`, `openai`, `deepseek`, `ollama`, … (15+)
`LOREGRAPH_LLM_MODEL`	preset per provider	OpenRouter preset = `deepseek/deepseek-v4-pro`
`LOREGRAPH_EMBED_MODEL`	`intfloat/multilingual-e5-large`	local, 1024-dim, multilingual
`DATABASE_URL`	local Postgres	must use the async `asyncpg` driver
`LOREGRAPH_COST_CEILING_USD`	`20`	per-book ceiling, enforced between passes (`0` disables)
`LOREGRAPH_PRICE_INPUT_PER_MTOK` · `…_OUTPUT_PER_MTOK`	DeepSeek V4 Pro	token prices for the cost estimate

Status & roadmap

Alpha. The extraction engine is production-hardened and the reference corpus is being processed.

8-pass evidence-anchored extraction pipeline
Production entity resolution (embedding blocking + batched matching + transitivity guard)
Multilingual (source text + embeddings + UI), prompt caching, concurrency, resumable runs
Web reading-room: Reader · Graph · Timeline · Index · Ask
Narrative-time graph — watch relationships evolve across the story (a slider for "as of chapter N")
Community / faction layer — auto-detected families, factions, subplots
Cross-book meta-graph — archetypes & influence linking all 85 works
Grounded character chat — "talk to" a character; answers cited + spoiler-aware
Quality scoring — per-book confidence / coverage metrics beyond the literal-match gate

Development

uv run --extra dev ruff format && uv run --extra dev ruff check   # lint + format
uv run --extra dev python -m mypy src                             # types
uv run --extra dev python -m pytest -m unit                       # fast unit tests
uv run --extra dev python -m pytest -m integration                # Postgres testcontainer + mocked LLM

Conventions live in CLAUDE.md; the per-pass spec is in docs/8-pass-pipeline.md.

License

Apache 2.0. Source texts are not part of this repository; the reference corpus is assembled locally from public sources (Project Gutenberg, Wikisource, …) at ingest time.

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.claude		.claude
.github		.github
data/exports		data/exports
docs		docs
examples		examples
migrations		migrations
scripts		scripts
src/loregraph		src/loregraph
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
render.yaml		render.yaml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LoreGraph

knowledge graphs that quote the page they came from

Why LoreGraph

What you get

The pipeline

Quick start

Architecture

The corpus

Configuration

Status & roadmap

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LoreGraph

knowledge graphs that quote the page they came from

Why LoreGraph

What you get

The pipeline

Quick start

Architecture

The corpus

Configuration

Status & roadmap

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages