Hybrid RAG with LLM Wiki Foundation

A three-level retrieval system over an LLM-curated knowledge pipeline.

Architecture · Pipeline · Retrieval Levels · Data · Evaluation · Demo notebook · Reproduction

Why this project

Most RAG systems retrieve from raw document chunks. This one retrieves from a substrate that has already been curated.

The substrate is an LLM wiki applied at system scale. The pattern follows Andrej Karpathy's LLM wiki concept (April 2026): an agent that ingests raw sources, synthesizes a structured knowledge base, and maintains it via lint operations. Karpathy described it for personal notes. This project applies it to corporate-scale IT support tickets — noisy, multi-author input where the harder engineering problems live.

On top of the substrate, three retrieval layers serve different query shapes:

L1 — Raw vector retrieval. Returns ticket snippets for thematic queries.
L2 — LLM Wiki. Returns curated extracts with citations for canonical-procedure queries.
L3 — Knowledge Graph. Returns answers from graph traversal for connected-reasoning queries.

An LLM router classifies each query and routes to the right layer. Source labels make the routing visible.

Architecture

Three independent retrieval layers. Each has its own storage. Each has its own response format. Each has its own privacy posture.

Layer	Storage	Returns	Access
L1 — Raw vector	pgvector + BM25	Ticket snippets with metadata	Query UI only
L2 — LLM Wiki	Markdown pages, hierarchical by domain	Wiki extract with citation	Query UI or read wiki
L3 — Knowledge Graph	Property graph store — 5 entities, 4 relations	NLP-style answer with wiki citations	Query UI only

Layers collaborate at build time. They serve in parallel at query time. The UI is a privacy boundary, not just a convenience surface.

→ Full detail: docs/architecture.md

Pipeline

Five LLM-driven operations across the three layers. Build flow cascades. Query flow fans out via the router.

Build flow. L1 embeds tickets. L2 ingest pulls new tickets in batches, pre-grouped by L1 similarity. L2 lint aggregates, deduplicates, and resolves conflicts into hierarchical wiki pages. L3 graph-build extracts entities and relationships from the linted wiki.

Query flow. User query → LLM router → routes to L1, L2, or L3 → labeled result to UI.

The hardest operation is lint. It turns 745 noisy tickets into coherent wiki pages. Two threshold guardrails control LLM cost and prevent category proliferation:

Pre-grouping similarity threshold. Tickets must clear a similarity bar (via L1 embeddings) to enter the same LLM-judged batch. Cheap vector similarity narrows the search space before expensive LLM judgment.
Category-creation similarity floor. When the LLM proposes a new wiki category, it must be sufficiently dissimilar from existing categories. Prevents LLM category proliferation, a known failure mode.

When the lint judge can't reconcile a conflict, the cluster routes to a pending_review.jsonl queue for human review. Human action is binary: accept or reject the LLM's recommendation. No general editing.

→ Full detail: docs/pipeline.md

Retrieval levels

Each layer answers a different shape of question. The router decides which.

Query shape	Example	Routes to	Returns
Thematic "what" over recent tickets	"What VPN problems in the last 30 days?"	L1	Ticket snippets with metadata
Canonical procedure or guidance	"What should support check for AD lockout?"	L2	Wiki extract with citation
Diagnostic reasoning from observations	"Lockout check failed but token check passed — what's wrong?"	L3	Answer from graph traversal

→ Full detail: docs/retrieval_levels.md

Data

The corpus is a synthetic ITSM dataset: 745 records across 14 indexed issue families. The eval design reserves a held-out family for the NEG-001 negative-case contract (planned, not yet designated in the corpus generator). The data files themselves live on Hugging Face and Kaggle (not in this repo) — pull them with datasets.load_dataset(...) at runtime.

A representative record is checked in at data/example_record.json so eval contracts can be read alongside concrete data without spinning up the dataset loader. For corpus stats, issue family glossary, and pull instructions, see data/README.md.

For a guided walkthrough of the dataset's design properties — phrasing variation, status tuples as observation fingerprints, and root cause clustering — see notebooks/dataset_exploration.ipynb.

Evaluation

Most RAG evaluations fall into a trap: an LLM invents the question and the golden answer, then another LLM grades the response. The corpus never enters the comparison. Failure modes hide behind stylistic agreement.

This project takes a different path. The corpus is the ground truth. A design-time field mapping fixes which corpus fields encode the eval-expectation input and which encode the eval-answer. The LLM is constrained to produce corpus-shaped output and is judged against actual corpus content. Same field on both sides.

Eleven contracts. Five issue families. Three per layer (L1, L2, L3) plus router and negative case. Each contract targets a distinct capability — broad similarity, filter respect, aggregation under fragmentation, branching causal reasoning, cross-family generalization. Volume comes from rerunning each contract, not from adding more.

→ Full methodology: eval/

Reproduction

Not yet runnable as of v0.9. The repo is in design phase. The setup block below works today; the python -m commands describe the intended interface and will land as L1 / L2 / L3 / eval modules are implemented. See Project status for what's wired up.

Environment setup (works today)

uv is the canonical workflow. A requirements.txt is exported for pip users.

git clone https://github.com/ameau01/llm-wiki-knowledge-engine.git
cd llm-wiki-knowledge-engine

# Option A — uv (recommended). Creates .venv and installs deps from pyproject.toml + uv.lock.
uv sync                              # runtime deps only
uv sync --extra dev --extra eval     # add dev tools + eval extras

# Option B — pip fallback. Install from the exported requirements.txt.
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Configuration
cp .env.example .env                 # then fill in OPENAI_API_KEY, LANGSMITH_API_KEY

# Local services (pgvector now; Neo4j optional, commented in compose file)
docker compose up -d

To regenerate requirements.txt after editing pyproject.toml:

uv pip compile pyproject.toml -o requirements.txt

Build and run (planned interface)

The commands below describe the target CLI surface. They will become runnable as each module lands.

# Pull corpus from Hugging Face and build L1 index
uv run python -m src.l1_raw.embed_and_index

# Build L2 wiki (ingest → lint)
uv run python -m src.l2_wiki.ingest
uv run python -m src.l2_wiki.lint

# Build L3 graph from linted wiki
uv run python -m src.l3_graph.build

# Run the eval pipeline
uv run python -m eval.src.run

# Launch the query UI
uv run uvicorn src.ui.app:app --reload

Reproducibility design: each run will write a manifest with model versions, seeds, prompt hashes, and the dataset snapshot used. The runtime that produces these manifests is part of the eval pipeline work in progress.

Stack

Component	Choice
Orchestration	LlamaIndex
Embeddings	OpenAI `text-embedding-3-small` (1536-dim)
LLM (lint judge, router, graph extraction)	GPT-4o-mini
Vector store	pgvector
Keyword search	BM25, fused via LlamaIndex hybrid retriever
Graph store	Property graph (Neo4j or in-process)
Observability	LangSmith
UI	FastAPI + lightweight static frontend

Project status

Honest snapshot as of late May 2026:

Component	Status
Synthetic corpus (745 records, 14 indexed issue families)	✅ Published to Hugging Face + Kaggle
Eval contracts (11 cases)	✅ Locked
Eval pipeline design (design-time / runtime)	✅ Specified
L1 — embed-and-index	🔨 In progress
L2 — ingest / lint / query	🔨 In progress
L3 — graph-build	⏳ Pending
Eval pipeline runner	⏳ Pending
UI with LLM router	⏳ Pending
Held-out issue family (for NEG-001)	⏳ Blocked on corpus generator update

Schedule slipped ~7 days due to illness. Ship target: May 31, 2026.

Honest limitations

Single domain, synthetic corpus. Findings will not transfer cleanly to other domains or to real ticket data without re-evaluation.
Eleven contracts is the structural scope, not statistical. Volume comes from rerunning each contract with fresh samples. Not yet wired up.
NEG-001 is BLOCKED. A held-out issue family needs to be designated in the corpus generator before graceful-abstention behavior can be evaluated end-to-end. Disclosed openly rather than worked around.
GPT-4o-mini for both generation and judging. Single-judge eval is a known weakness. Different judge model would change scores.

What this project is not

Not an agentic system. No tool use, no action harnesses. Belongs to a separate project that could consume this wiki as its knowledge source.
Not a frontier-model showcase. The architecture is the point, not the model.
Not a heavyweight knowledge graph. Five entities, four relations. Domain-scoped, not ontology engineering.
Not multi-source. ITSM ticket data only. Multi-source ingest is future work.

Project status

Design documentation: architecture.md, readme.md

License

MIT. See LICENSE. The synthetic corpus is also MIT-licensed on Hugging Face and Kaggle.

Citation

@misc{llm_wiki_knowledge_engine_2026,
  title  = {LLM Wiki Knowledge Engine},
  author = {Alexander Meau},
  year   = {2026},
  version = {0.1.0}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hybrid RAG with LLM Wiki Foundation

Architecture · Pipeline · Retrieval Levels · Data · Evaluation · Demo notebook · Reproduction

Why this project

Architecture

Pipeline

Retrieval levels

Data

Evaluation

Reproduction

Environment setup (works today)

Build and run (planned interface)

Stack

Project status

Honest limitations

What this project is not

Project status

License

Citation

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
docs		docs
eval		eval
notebooks		notebooks
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Hybrid RAG with LLM Wiki Foundation

Architecture · Pipeline · Retrieval Levels · Data · Evaluation · Demo notebook · Reproduction

Why this project

Architecture

Pipeline

Retrieval levels

Data

Evaluation

Reproduction

Environment setup (works today)

Build and run (planned interface)

Stack

Project status

Honest limitations

What this project is not

Project status

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages