⚔ LLM Dungeon & Dragon

A dungeon crawler where game logic is fully deterministic and an LLM acts purely as a narrator.

Core idea: the engine decides truth. The LLM explains it.

LLM output is probabilistic and unreliable by default. It can hallucinate facts, drift from instructions, return malformed structure, or fail entirely. None of these are bugs — they're inherent properties of the model, and any system built on top of an LLM has to assume them as a baseline. This project demonstrates how to engineer the controls around an LLM so its output becomes dependable. The core idea: treat the LLM as a microservice behind a strict API contract. Prompts define the request shape, Pydantic models define the response shape, validation enforces the contract on every call, and deterministic fallbacks fire when validation fails. The same discipline you'd apply to any unreliable external service applies here — the LLM just happens to be the unreliable service.

What this project demonstrates

Clean separation between deterministic game logic and AI narration
Controlled, constrained use of LLMs — the model never touches game state
Structured JSON output with Pydantic validation and graceful fallback (The LLM is asked to output JSON only.)
Per-character voice profiles — prompt engineering for stylistically distinct narration
Prompt versioning — prompts treated as versioned artifacts, not hardcoded strings
Automatic evaluation layer — every narration checked against its own constraints
Bounded context management via rolling memory summarization
Token budget tracking — context usage estimated and logged per call
Production-grade reliability — exponential backoff retry on API failure
Full observability — every prompt, raw response, eval result, and token count is traceable
Session export — full debug data downloadable as JSON for offline analysis
Modular, extensible architecture — multiple enemies, defend, flee, all added without touching the LLM layer

Architecture

Button click → engine step → LLM narration (retry + memory + voice) → eval → UI refresh

Layer	Role	Tech
`engine/`	Dice rolls, combat, actions, state	Pure Python + Pydantic
`llm/`	Narration, prompts, memory, eval, registry	Anthropic API
`ui/`	Interactive interface	Streamlit
`tests/`	Engine + evaluator + prompt registry tests	pytest

The LLM never modifies state. State is the single source of truth.

Project Structure

dnd_llm/
│
├── engine/
│   ├── state.py        # GameState, Fighter, ActionLog, NarrationResult,
│   │                   # DebugEntry, NarrationEval
│   ├── combat.py       # Dice rolls, player/enemy attack (multi-enemy)
│   └── actions.py      # Defend and flee resolvers
│
├── llm/
│   ├── narrator.py     # narrate() — retry logic, returns (result, raw, eval, context, version)
│   ├── prompts.py      # Templates, state serializer, voice profiles, token budget
│   ├── prompt_registry.py  # Versioned prompt store (v1.0, v1.1)
│   ├── evaluator.py    # Automatic constraint checking on every narration
│   └── memory.py       # Rolling summarization, context management
│
├── ui/
│   └── app.py          # Streamlit UI — scenario selector, eval bar,
│                       # debug panel, streaming toggle, session export
│
├── game/
│   └── loop.py         # CLI fallback for engine debugging
│
├── tests/
│   ├── test_combat.py
│   ├── test_prompts.py
│   ├── test_evaluator.py
│   └── test_prompt_registry.py
│
├── main.py             # CLI entry point
├── .env                # API key (not committed)
└── README.md

Quickstart

1. Clone and install

git clone https://github.com/yourname/dnd_llm.git
cd dnd_llm

python -m venv .venv
.venv\Scripts\activate        # Windows
# source .venv/bin/activate   # macOS / Linux

pip install -r requirements.txt

2. Set your API key

Create a .env file at the project root:

ANTHROPIC_API_KEY=your_key_here

Get a key at console.anthropic.com.

3. Run the UI

streamlit run ui/app.py

4. Or run the CLI

python main.py

Running Tests

pytest tests/ -v

28 tests covering the deterministic engine, prompt parsing, evaluation layer, and prompt registry. LLM calls are not unit tested — they are validated via the CLI smoke test, the in-app eval summary, and session export.

Features

Scenario selector

Choose your enemy before the fight begins. Each scenario has distinct stats, enemy count, and narration voice:

Scenario	Enemies	Voice
Goblin Ambush	Goblin + Goblin Scout	Chaotic, crude, desperate
Orc Warlord	Orc Warlord	Brutal, proud, honour-focused
Skeleton Guard	Skeleton + Skeleton Archer	Cold, mechanical, ancient
Ancient Dragon	Ancient Dragon	Contemptuous, grand, slow

Multiple enemies

All living enemies attack each turn. Victory requires defeating every enemy. The engine loops over the enemy list — the LLM layer needed zero changes to support this.

Combat actions

Action	Effect
⚔ Attack	Player attacks all living enemies. Each enemy counterattacks.
🛡 Defend	Incoming damage halved this turn. All enemies still attack.
🏃 Flee	Roll d20 ≥ 10 to escape. Failure = free enemy attack.

Prompt versioning

Prompts are versioned artifacts stored in llm/prompt_registry.py:

Version	What changed
v1.0	Basic constraints only — no voice injection
v1.1	Per-character voice profile + few-shot example injected at runtime

Select the active version from the UI before the fight. Each debug entry logs which version produced it.

Tone-colored game log

Narration entries are colored by the LLM-returned tone field:

🟠 tense — the fight is balanced
🟢 victorious — a decisive blow
🔴 grim — taking heavy damage
⚫ neutral — a miss or uneventful turn

Automatic evaluation layer

Every narration is checked against the constraints defined in the system prompt:

Check	What it catches
`hp_mentioned`	LLM leaking raw HP numbers into prose
`sentence_count`	Output outside the 2-3 sentence constraint
`format_valid`	JSON response that couldn't be parsed
`fallback_used`	API failure or total parse failure

A live eval summary bar shows total turns, passes, and pass rate across the session.

Retry logic with exponential backoff

API calls retry up to 3 times (waits: 1s, 2s, 4s) before falling back to a deterministic narration. The game never crashes on a transient API error.

Token budget tracking

Every narration call estimates token usage (system prompt + user prompt) and logs it against the context limit. Visible in the debug panel per turn.

Observability debug panel

Expand the debug panel at any point to inspect every turn:

Pass/fail status with per-check breakdown
Prompt version used
Token usage vs context budget
Exact state snapshot sent (semantic labels, not raw numbers)
Full prompt sent to the LLM
Raw JSON string the model returned
Parsed and validated NarrationResult

Session export

Download the complete session as a structured JSON file — including all prompts, raw responses, eval results, and token counts.

Design Rules

These are non-negotiable constraints that keep the architecture clean:

LLM never modifies state — narration is read-only
Dice rolls are deterministic — Python random only, never LLM
State is the single source of truth — Pydantic model, no duplication
UI is stateless — renders current state only, no logic
Prompts describe outcomes — the LLM is never asked to decide what happens
State is serialized to semantic labels before reaching the LLM — hp=3 becomes "bloodied", not a raw number

LLM Prompt Strategy

The narrator receives a structured prompt built from three components:

Memory block — a rolling summary of older turns + the last 3 turns verbatim
Game state — semantic labels (e.g. "critically wounded") not raw numbers
Action result — what the engine computed (roll, damage, actor)

The system prompt is generated dynamically per enemy type and prompt version, injecting a voice profile and a few-shot example:

The enemy in this scene is an orc warlord.
Narrate their actions in a brutal and proud voice.
Vocabulary guidance: heavy, deliberate, honour-focused.
Example of correct tone: "The orc advances without flinching, each blow a statement of dominance."

Output is constrained to valid JSON:

{
  "narration": "2-3 sentence description",
  "tone": "tense" | "victorious" | "grim" | "neutral",
  "hit": true | false
}

If the model returns invalid JSON after retries, a deterministic fallback fires — the game never crashes on a bad LLM response.

Estimated API Cost

Using claude-haiku-4-5-20251001 (~2 calls per turn):

Session length	Estimated cost
10 turns	< $0.01
Full evening of testing	< $0.10

Roadmap

True streaming narration (custom text format, parse after stream closes)
Prompt v1.2 — A/B comparison mode in UI
Inventory and item system
Spells and abilities
Procedural dungeon rooms
Save / load game state
Project 2 — agentic LLM system with tool-calling and RAG (separate repo)

Requirements

anthropic
streamlit
pydantic
python-dotenv
pytest

Generate requirements.txt:

pip freeze > requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚔ LLM Dungeon & Dragon

What this project demonstrates

Architecture

Project Structure

Quickstart

1. Clone and install

2. Set your API key

3. Run the UI

4. Or run the CLI

Running Tests

Features

Scenario selector

Multiple enemies

Combat actions

Prompt versioning

Tone-colored game log

Automatic evaluation layer

Retry logic with exponential backoff

Token budget tracking

Observability debug panel

Session export

Design Rules

LLM Prompt Strategy

Estimated API Cost

Roadmap

Requirements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
engine		engine
game		game
llm		llm
tests		tests
ui		ui
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

⚔ LLM Dungeon & Dragon

What this project demonstrates

Architecture

Project Structure

Quickstart

1. Clone and install

2. Set your API key

3. Run the UI

4. Or run the CLI

Running Tests

Features

Scenario selector

Multiple enemies

Combat actions

Prompt versioning

Tone-colored game log

Automatic evaluation layer

Retry logic with exponential backoff

Token budget tracking

Observability debug panel

Session export

Design Rules

LLM Prompt Strategy

Estimated API Cost

Roadmap

Requirements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages