A three-level retrieval system over an LLM-curated knowledge pipeline.
Most RAG systems retrieve from raw document chunks. This one retrieves from a substrate that has already been curated.
The substrate is an LLM wiki applied at system scale. The pattern follows Andrej Karpathy's LLM wiki concept (April 2026): an agent that ingests raw sources, synthesizes a structured knowledge base, and maintains it via lint operations. Karpathy described it for personal notes. This project applies it to corporate-scale IT support tickets — noisy, multi-author input where the harder engineering problems live.
On top of the substrate, three retrieval layers serve different query shapes:
- L1 — Raw vector retrieval. Returns ticket snippets for thematic queries.
- L2 — LLM Wiki. Returns curated extracts with citations for canonical-procedure queries.
- L3 — Knowledge Graph. Returns answers from graph traversal for connected-reasoning queries.
An LLM router classifies each query and routes to the right layer. Source labels make the routing visible.
Three independent retrieval layers. Each has its own storage. Each has its own response format. Each has its own privacy posture.
| Layer | Storage | Returns | Access |
|---|---|---|---|
| L1 — Raw vector | pgvector + BM25 | Ticket snippets with metadata | Query UI only |
| L2 — LLM Wiki | Markdown pages, hierarchical by domain | Wiki extract with citation | Query UI or read wiki |
| L3 — Knowledge Graph | Property graph store — 5 entities, 4 relations | NLP-style answer with wiki citations | Query UI only |
Layers collaborate at build time. They serve in parallel at query time. The UI is a privacy boundary, not just a convenience surface.
→ Full detail: docs/architecture.md
Five LLM-driven operations across the three layers. Build flow cascades. Query flow fans out via the router.
Build flow. L1 embeds tickets. L2 ingest pulls new tickets in batches, pre-grouped by L1 similarity. L2 lint aggregates, deduplicates, and resolves conflicts into hierarchical wiki pages. L3 graph-build extracts entities and relationships from the linted wiki.
Query flow. User query → LLM router → routes to L1, L2, or L3 → labeled result to UI.
The hardest operation is lint. It turns 745 noisy tickets into coherent wiki pages. Two threshold guardrails control LLM cost and prevent category proliferation:
- Pre-grouping similarity threshold. Tickets must clear a similarity bar (via L1 embeddings) to enter the same LLM-judged batch. Cheap vector similarity narrows the search space before expensive LLM judgment.
- Category-creation similarity floor. When the LLM proposes a new wiki category, it must be sufficiently dissimilar from existing categories. Prevents LLM category proliferation, a known failure mode.
When the lint judge can't reconcile a conflict, the cluster routes to a pending_review.jsonl queue for human review. Human action is binary: accept or reject the LLM's recommendation. No general editing.
→ Full detail: docs/pipeline.md
Each layer answers a different shape of question. The router decides which.
| Query shape | Example | Routes to | Returns |
|---|---|---|---|
| Thematic "what" over recent tickets | "What VPN problems in the last 30 days?" | L1 | Ticket snippets with metadata |
| Canonical procedure or guidance | "What should support check for AD lockout?" | L2 | Wiki extract with citation |
| Diagnostic reasoning from observations | "Lockout check failed but token check passed — what's wrong?" | L3 | Answer from graph traversal |
→ Full detail: docs/retrieval_levels.md
The corpus is a synthetic ITSM dataset: 745 records across 14 indexed issue families. The eval design reserves a held-out family for the NEG-001 negative-case contract (planned, not yet designated in the corpus generator). The data files themselves live on Hugging Face and Kaggle (not in this repo) — pull them with datasets.load_dataset(...) at runtime.
A representative record is checked in at data/example_record.json so eval contracts can be read alongside concrete data without spinning up the dataset loader. For corpus stats, issue family glossary, and pull instructions, see data/README.md.
For a guided walkthrough of the dataset's design properties — phrasing variation, status tuples as observation fingerprints, and root cause clustering — see notebooks/dataset_exploration.ipynb.
Most RAG evaluations fall into a trap: an LLM invents the question and the golden answer, then another LLM grades the response. The corpus never enters the comparison. Failure modes hide behind stylistic agreement.
This project takes a different path. The corpus is the ground truth. A design-time field mapping fixes which corpus fields encode the eval-expectation input and which encode the eval-answer. The LLM is constrained to produce corpus-shaped output and is judged against actual corpus content. Same field on both sides.
Eleven contracts. Five issue families. Three per layer (L1, L2, L3) plus router and negative case. Each contract targets a distinct capability — broad similarity, filter respect, aggregation under fragmentation, branching causal reasoning, cross-family generalization. Volume comes from rerunning each contract, not from adding more.
→ Full methodology: eval/
Not yet runnable as of v0.9. The repo is in design phase. The setup block below works today; the
python -mcommands describe the intended interface and will land as L1 / L2 / L3 / eval modules are implemented. See Project status for what's wired up.
uv is the canonical workflow. A requirements.txt is exported for pip users.
git clone https://github.com/ameau01/llm-wiki-knowledge-engine.git
cd llm-wiki-knowledge-engine
# Option A — uv (recommended). Creates .venv and installs deps from pyproject.toml + uv.lock.
uv sync # runtime deps only
uv sync --extra dev --extra eval # add dev tools + eval extras
# Option B — pip fallback. Install from the exported requirements.txt.
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Configuration
cp .env.example .env # then fill in OPENAI_API_KEY, LANGSMITH_API_KEY
# Local services (pgvector now; Neo4j optional, commented in compose file)
docker compose up -dTo regenerate requirements.txt after editing pyproject.toml:
uv pip compile pyproject.toml -o requirements.txtThe commands below describe the target CLI surface. They will become runnable as each module lands.
# Pull corpus from Hugging Face and build L1 index
uv run python -m src.l1_raw.embed_and_index
# Build L2 wiki (ingest → lint)
uv run python -m src.l2_wiki.ingest
uv run python -m src.l2_wiki.lint
# Build L3 graph from linted wiki
uv run python -m src.l3_graph.build
# Run the eval pipeline
uv run python -m eval.src.run
# Launch the query UI
uv run uvicorn src.ui.app:app --reloadReproducibility design: each run will write a manifest with model versions, seeds, prompt hashes, and the dataset snapshot used. The runtime that produces these manifests is part of the eval pipeline work in progress.
| Component | Choice |
|---|---|
| Orchestration | LlamaIndex |
| Embeddings | OpenAI text-embedding-3-small (1536-dim) |
| LLM (lint judge, router, graph extraction) | GPT-4o-mini |
| Vector store | pgvector |
| Keyword search | BM25, fused via LlamaIndex hybrid retriever |
| Graph store | Property graph (Neo4j or in-process) |
| Observability | LangSmith |
| UI | FastAPI + lightweight static frontend |
Honest snapshot as of late May 2026:
| Component | Status |
|---|---|
| Synthetic corpus (745 records, 14 indexed issue families) | ✅ Published to Hugging Face + Kaggle |
| Eval contracts (11 cases) | ✅ Locked |
| Eval pipeline design (design-time / runtime) | ✅ Specified |
| L1 — embed-and-index | 🔨 In progress |
| L2 — ingest / lint / query | 🔨 In progress |
| L3 — graph-build | ⏳ Pending |
| Eval pipeline runner | ⏳ Pending |
| UI with LLM router | ⏳ Pending |
| Held-out issue family (for NEG-001) | ⏳ Blocked on corpus generator update |
Schedule slipped ~7 days due to illness. Ship target: May 31, 2026.
- Single domain, synthetic corpus. Findings will not transfer cleanly to other domains or to real ticket data without re-evaluation.
- Eleven contracts is the structural scope, not statistical. Volume comes from rerunning each contract with fresh samples. Not yet wired up.
- NEG-001 is BLOCKED. A held-out issue family needs to be designated in the corpus generator before graceful-abstention behavior can be evaluated end-to-end. Disclosed openly rather than worked around.
- GPT-4o-mini for both generation and judging. Single-judge eval is a known weakness. Different judge model would change scores.
- Not an agentic system. No tool use, no action harnesses. Belongs to a separate project that could consume this wiki as its knowledge source.
- Not a frontier-model showcase. The architecture is the point, not the model.
- Not a heavyweight knowledge graph. Five entities, four relations. Domain-scoped, not ontology engineering.
- Not multi-source. ITSM ticket data only. Multi-source ingest is future work.
- Design documentation: architecture.md, readme.md
MIT. See LICENSE.
The synthetic corpus is also MIT-licensed on Hugging Face and Kaggle.
@misc{llm_wiki_knowledge_engine_2026,
title = {LLM Wiki Knowledge Engine},
author = {Alexander Meau},
year = {2026},
version = {0.1.0}
}

