⚔ DnD Agentic LLM — The Dungeon of Ludele

An agentic dungeon crawler demonstrating four distinct LLM roles orchestrated against a deterministic game engine. The LLM is the dungeon master. This version stops at 1st floor.

Project 1 showed where not to trust LLMs (constrained narration only). Project 1 link. Project 2 shows where you can — through tool-calling, RAG, and persona dialog.

Project 1 demonstrated how to make LLM output reliable by treating the model as a service behind a strict API contract. The same discipline applies here — every LLM call still goes through a typed contract with validation and fallback. But this framing is no longer the end goal of the project.

We want to give the LLM real agency in a small, controlled way. Instead of one role (narration), the system orchestrates four: a tool-calling agent that decides what populates each room, a constrained narrator, a persona-driven NPC dialog system, and a retrieval-augmented oracle for world lore. Each role expands what the LLM is allowed to do while a different mechanism keeps it bounded — a tool catalog, a Pydantic schema, a knowledge scope, or a vector store. The engine remains the source of truth, but the LLM is no longer purely a presentation layer. It can act, decide, and reason — within deliberately drawn limits.

What this project demonstrates

Tool-calling agent — DM agent decides what populates each room via Python tools, never by mutating state directly
RAG-grounded lore oracle — answers world questions from a local vector store, refuses gracefully when knowledge isn't there
Persona-driven NPC dialog — two NPCs with distinct voices and bounded knowledge scopes
Constrained narrator — every action gets atmospheric prose, validated against constraints (HP leak detection, length, format)
Multi-role orchestration — four LLMs with one job each, no role bleed
Full observability — every prompt, tool call, raw response, and eval result is traceable
Local ML pipeline — ChromaDB + sentence-transformers, no cloud dependency for retrieval

Architecture

Player action → Engine resolves deterministically
                    ↓
         ┌──────────┼──────────┐
         ↓          ↓          ↓
    Narrator   DM Agent    NPC dialog
    (prose)    (tools)     (persona)
                    ↓
              RAG oracle
              (lore queries)

LLM Role	Pattern	Purpose
DM Agent	Tool-calling	Populates rooms via `spawn_enemy`, `add_item_to_room`, `describe_room`
Narrator	Structured JSON output	2-3 sentence prose per action, with eval
NPC Dialog	Persona injection + history	In-character conversations within knowledge bounds
Lore Oracle	RAG	Answers world questions from local lore docs

The engine never asks the LLM what should happen. The LLM only requests changes through tools, narrates outcomes, voices characters, or retrieves lore.

Project Structure

dnd_agentic_llm/
│
├── engine/                  # Pure Python — no LLM
│   ├── state.py             # All Pydantic models
│   ├── combat.py            # Graduated d20 combat
│   ├── actions.py           # Move, use, cast, flee
│   ├── map.py               # 3x3 dungeon grid
│   └── inventory.py         # Items + equipment
│
├── llm/
│   ├── dm_agent.py          # Tool-calling room populator
│   ├── narrator.py          # Constrained JSON narration
│   ├── npc.py               # Persona-driven dialog
│   ├── rag.py               # ChromaDB + sentence-transformers
│   ├── memory.py            # Rolling summarization
│   └── evaluator.py         # Output constraint checks
│
├── ui/
│   └── app.py               # Streamlit CRPG-style interface
│
├── game/
│   └── loop.py              # CLI fallback
│
├── data/
│   ├── lore/                # Markdown source documents
│   │   ├── world_lore.md
│   │   ├── bestiary.md
│   │   ├── item_glossary.md
│   │   └── npcs_and_rooms.md
│   └── chroma_db/           # Auto-generated vector store
│
├── tests/                   # 30+ unit tests
└── cli.py                  # CLI entry point

Quickstart

git clone https://github.com/yourname/dnd-agentic-llm.git
cd dnd-agentic-llm

python -m venv .venv
.venv\Scripts\activate           # Windows
# source .venv/bin/activate      # macOS / Linux

pip install -r requirements.txt

Set your API key in .env:

ANTHROPIC_API_KEY=your_key_here

Get one at console.anthropic.com.

Run the UI:

streamlit run ui/app.py

Run the CLI fallback:

python cli.py

First run will download the embedding model (~80MB) and build the vector index from the lore files. Takes ~30 seconds.

Running Tests

pytest tests/ -v

Tests cover the deterministic engine, tool execution, RAG chunking, evaluator, and NPC personas. LLM calls are not unit tested — they're validated via the in-app debug panel.

Key Features

The four LLM roles in action

DM Agent populates rooms when the player first enters them. It cannot mutate state directly — only request changes through tools:

spawn_enemy(room_id="1_1", enemy_type="goblin", count=2)
add_item_to_room(room_id="1_1", item_id="rusty_sword")
describe_room(room_id="1_1", description="Dust motes drift through...")

Narrator turns engine output into prose with strict constraints:

{
  "narration": "The blade finds its mark. The goblin staggers, wounded.",
  "tone": "tense",
  "success": true
}

NPCs speak from injected personas with bounded knowledge. Aldric the merchant talks like a nervous trader; Garron the veteran guard speaks in clipped military phrasing. Both refuse to discuss things outside their knows_about scope, in character.

Lore Oracle answers world questions via RAG. If the answer isn't in the retrieved chunks: "The chronicles do not speak of this." No hallucination. The LLM is constrained with prompt engineering (no hard mechanism). In this case, the LLM still has access to its knowledge but is required to ONLY use the retrieved lore chunks.

Observability

Every LLM call is logged to a debug panel:

Which prompt version produced what output
Which tools the DM agent called
Whether each narration passed evaluation (HP leak, sentence count, format)
Token usage and conversation summaries

Graceful degradation

API failures retry with exponential backoff. If retries exhaust, deterministic fallback narration fires. The game never crashes on a bad LLM response.

Design Rules (non-negotiable)

LLM never modifies directly state — only via tool calls, narration, or dialog
Engine resolves all mechanics — dice rolls, damage, movement
Narrator never invents facts — describes only what the engine produced
NPCs stay in scope — refuse cleanly when asked outside their knowledge
RAG oracle stays grounded — answers only from retrieved chunks (is constrained to lore)
Player owns agency — movement, item use, casting, talking, fighting are always player choices
DM agent owns the world — what's in rooms, who NPCs are, what events trigger

Roadmap

Middle and deeper layer content (already sketched in lore)
Trading system so Aldric can actually sell items
More NPCs (Mira the Sage, Lethiel of the Shards, Thimble the Fae)
Enemy AI variation (some flee, some call reinforcements)
Save/load runs
Multi-session memory across dungeon runs

Estimated API Cost

Using claude-haiku-4-5-20251001:

Activity	Approximate cost
One full dungeon run (20-30 actions)	< $0.05
Full evening of testing	< $0.50

Local RAG (ChromaDB + sentence-transformers) costs nothing per query.

Requirements

anthropic
streamlit
pydantic
python-dotenv
chromadb
sentence-transformers
pytest

Generate from your env:

pip freeze > requirements.txt

What I learned

The hard part wasn't the LLM, it was working around it. Most of the work went into state controls, schemas, and the evaluator — not prompt engineering. The LLM ends up behaving like any other unreliable service, and you can treat it like one.

Splitting work across multiple specialized LLMs beat one general-purpose call. Four narrow roles produced better output than one prompt trying to do everything, and the prompts themselves got simpler.

Prompt-only grounding worked, but I wouldn't trust it for higher stakes. Telling the oracle to "only use this context" was effective in practice for this project. I didn't stress-test it adversarially though, and for anything correctness-critical I'd want an enforceable mechanism, not a polite instruction. Observability paid for itself fast. Logging every prompt, tool call, and eval verdict turned "that response was weird" into reproducible bugs. Day-one work on any future LLM project.

What I'd do differently

Rethink the UI from scratch, not improve Streamlit. Most of my testing happened in the CLI, and it honestly felt better than the Streamlit interface. Streamlit is great for inspecting evaluator output and debug data — but a real frontend for this kind of game isn't a repurposed dashboard, it's a proper game UI. That's a different project.

Author the lore corpus for retrieval, not for humans. Right now it reads as natural prose, then gets chunked. Writing it as self-contained passages with consistent entity naming would likely improve grounding without touching any prompts.

Reach for a real agent framework on a v3. Hand making the orchestration was the right call for learning. But features like persistent memory across sessions, self-correction and more would justify using something more sophisticated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚔ DnD Agentic LLM — The Dungeon of Ludele

What this project demonstrates

Architecture

Project Structure

Quickstart

Running Tests

Key Features

The four LLM roles in action

Observability

Graceful degradation

Design Rules (non-negotiable)

Roadmap

Estimated API Cost

Requirements

What I learned

What I'd do differently

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
engine		engine
game		game
llm		llm
tests		tests
ui		ui
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
cli.py		cli.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

⚔ DnD Agentic LLM — The Dungeon of Ludele

What this project demonstrates

Architecture

Project Structure

Quickstart

Running Tests

Key Features

The four LLM roles in action

Observability

Graceful degradation

Design Rules (non-negotiable)

Roadmap

Estimated API Cost

Requirements

What I learned

What I'd do differently

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages