An agentic dungeon crawler demonstrating four distinct LLM roles orchestrated against a deterministic game engine. The LLM is the dungeon master. This version stops at 1st floor.
Project 1 showed where not to trust LLMs (constrained narration only). Project 1 link. Project 2 shows where you can — through tool-calling, RAG, and persona dialog.
Project 1 demonstrated how to make LLM output reliable by treating the model as a service behind a strict API contract. The same discipline applies here — every LLM call still goes through a typed contract with validation and fallback. But this framing is no longer the end goal of the project.
We want to give the LLM real agency in a small, controlled way. Instead of one role (narration), the system orchestrates four: a tool-calling agent that decides what populates each room, a constrained narrator, a persona-driven NPC dialog system, and a retrieval-augmented oracle for world lore. Each role expands what the LLM is allowed to do while a different mechanism keeps it bounded — a tool catalog, a Pydantic schema, a knowledge scope, or a vector store. The engine remains the source of truth, but the LLM is no longer purely a presentation layer. It can act, decide, and reason — within deliberately drawn limits.
- Tool-calling agent — DM agent decides what populates each room via Python tools, never by mutating state directly
- RAG-grounded lore oracle — answers world questions from a local vector store, refuses gracefully when knowledge isn't there
- Persona-driven NPC dialog — two NPCs with distinct voices and bounded knowledge scopes
- Constrained narrator — every action gets atmospheric prose, validated against constraints (HP leak detection, length, format)
- Multi-role orchestration — four LLMs with one job each, no role bleed
- Full observability — every prompt, tool call, raw response, and eval result is traceable
- Local ML pipeline — ChromaDB + sentence-transformers, no cloud dependency for retrieval
Player action → Engine resolves deterministically
↓
┌──────────┼──────────┐
↓ ↓ ↓
Narrator DM Agent NPC dialog
(prose) (tools) (persona)
↓
RAG oracle
(lore queries)
| LLM Role | Pattern | Purpose |
|---|---|---|
| DM Agent | Tool-calling | Populates rooms via spawn_enemy, add_item_to_room, describe_room |
| Narrator | Structured JSON output | 2-3 sentence prose per action, with eval |
| NPC Dialog | Persona injection + history | In-character conversations within knowledge bounds |
| Lore Oracle | RAG | Answers world questions from local lore docs |
The engine never asks the LLM what should happen. The LLM only requests changes through tools, narrates outcomes, voices characters, or retrieves lore.
dnd_agentic_llm/
│
├── engine/ # Pure Python — no LLM
│ ├── state.py # All Pydantic models
│ ├── combat.py # Graduated d20 combat
│ ├── actions.py # Move, use, cast, flee
│ ├── map.py # 3x3 dungeon grid
│ └── inventory.py # Items + equipment
│
├── llm/
│ ├── dm_agent.py # Tool-calling room populator
│ ├── narrator.py # Constrained JSON narration
│ ├── npc.py # Persona-driven dialog
│ ├── rag.py # ChromaDB + sentence-transformers
│ ├── memory.py # Rolling summarization
│ └── evaluator.py # Output constraint checks
│
├── ui/
│ └── app.py # Streamlit CRPG-style interface
│
├── game/
│ └── loop.py # CLI fallback
│
├── data/
│ ├── lore/ # Markdown source documents
│ │ ├── world_lore.md
│ │ ├── bestiary.md
│ │ ├── item_glossary.md
│ │ └── npcs_and_rooms.md
│ └── chroma_db/ # Auto-generated vector store
│
├── tests/ # 30+ unit tests
└── cli.py # CLI entry point
git clone https://github.com/yourname/dnd-agentic-llm.git
cd dnd-agentic-llm
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS / Linux
pip install -r requirements.txtSet your API key in .env:
ANTHROPIC_API_KEY=your_key_here
Get one at console.anthropic.com.
Run the UI:
streamlit run ui/app.pyRun the CLI fallback:
python cli.pyFirst run will download the embedding model (~80MB) and build the vector index from the lore files. Takes ~30 seconds.
pytest tests/ -vTests cover the deterministic engine, tool execution, RAG chunking, evaluator, and NPC personas. LLM calls are not unit tested — they're validated via the in-app debug panel.
DM Agent populates rooms when the player first enters them. It cannot mutate state directly — only request changes through tools:
spawn_enemy(room_id="1_1", enemy_type="goblin", count=2)
add_item_to_room(room_id="1_1", item_id="rusty_sword")
describe_room(room_id="1_1", description="Dust motes drift through...")Narrator turns engine output into prose with strict constraints:
{
"narration": "The blade finds its mark. The goblin staggers, wounded.",
"tone": "tense",
"success": true
}NPCs speak from injected personas with bounded knowledge. Aldric the merchant talks like a nervous trader; Garron the veteran guard speaks in clipped military phrasing. Both refuse to discuss things outside their knows_about scope, in character.
Lore Oracle answers world questions via RAG. If the answer isn't in the retrieved chunks: "The chronicles do not speak of this." No hallucination. The LLM is constrained with prompt engineering (no hard mechanism). In this case, the LLM still has access to its knowledge but is required to ONLY use the retrieved lore chunks.
Every LLM call is logged to a debug panel:
- Which prompt version produced what output
- Which tools the DM agent called
- Whether each narration passed evaluation (HP leak, sentence count, format)
- Token usage and conversation summaries
API failures retry with exponential backoff. If retries exhaust, deterministic fallback narration fires. The game never crashes on a bad LLM response.
- LLM never modifies directly state — only via tool calls, narration, or dialog
- Engine resolves all mechanics — dice rolls, damage, movement
- Narrator never invents facts — describes only what the engine produced
- NPCs stay in scope — refuse cleanly when asked outside their knowledge
- RAG oracle stays grounded — answers only from retrieved chunks (is constrained to lore)
- Player owns agency — movement, item use, casting, talking, fighting are always player choices
- DM agent owns the world — what's in rooms, who NPCs are, what events trigger
- Middle and deeper layer content (already sketched in lore)
- Trading system so Aldric can actually sell items
- More NPCs (Mira the Sage, Lethiel of the Shards, Thimble the Fae)
- Enemy AI variation (some flee, some call reinforcements)
- Save/load runs
- Multi-session memory across dungeon runs
Using claude-haiku-4-5-20251001:
| Activity | Approximate cost |
|---|---|
| One full dungeon run (20-30 actions) | < $0.05 |
| Full evening of testing | < $0.50 |
Local RAG (ChromaDB + sentence-transformers) costs nothing per query.
anthropic
streamlit
pydantic
python-dotenv
chromadb
sentence-transformers
pytest
Generate from your env:
pip freeze > requirements.txtThe hard part wasn't the LLM, it was working around it. Most of the work went into state controls, schemas, and the evaluator — not prompt engineering. The LLM ends up behaving like any other unreliable service, and you can treat it like one.
Splitting work across multiple specialized LLMs beat one general-purpose call. Four narrow roles produced better output than one prompt trying to do everything, and the prompts themselves got simpler.
Prompt-only grounding worked, but I wouldn't trust it for higher stakes. Telling the oracle to "only use this context" was effective in practice for this project. I didn't stress-test it adversarially though, and for anything correctness-critical I'd want an enforceable mechanism, not a polite instruction. Observability paid for itself fast. Logging every prompt, tool call, and eval verdict turned "that response was weird" into reproducible bugs. Day-one work on any future LLM project.
Rethink the UI from scratch, not improve Streamlit. Most of my testing happened in the CLI, and it honestly felt better than the Streamlit interface. Streamlit is great for inspecting evaluator output and debug data — but a real frontend for this kind of game isn't a repurposed dashboard, it's a proper game UI. That's a different project.
Author the lore corpus for retrieval, not for humans. Right now it reads as natural prose, then gets chunked. Writing it as self-contained passages with consistent entity naming would likely improve grounding without touching any prompts.
Reach for a real agent framework on a v3. Hand making the orchestration was the right call for learning. But features like persistent memory across sessions, self-correction and more would justify using something more sophisticated.