Skip to content

ldele/multi-agent-llm-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚔ DnD Agentic LLM — The Dungeon of Ludele

An agentic dungeon crawler demonstrating four distinct LLM roles orchestrated against a deterministic game engine. The LLM is the dungeon master. This version stops at 1st floor.

Project 1 showed where not to trust LLMs (constrained narration only). Project 1 link. Project 2 shows where you can — through tool-calling, RAG, and persona dialog.

Project 1 demonstrated how to make LLM output reliable by treating the model as a service behind a strict API contract. The same discipline applies here — every LLM call still goes through a typed contract with validation and fallback. But this framing is no longer the end goal of the project.

We want to give the LLM real agency in a small, controlled way. Instead of one role (narration), the system orchestrates four: a tool-calling agent that decides what populates each room, a constrained narrator, a persona-driven NPC dialog system, and a retrieval-augmented oracle for world lore. Each role expands what the LLM is allowed to do while a different mechanism keeps it bounded — a tool catalog, a Pydantic schema, a knowledge scope, or a vector store. The engine remains the source of truth, but the LLM is no longer purely a presentation layer. It can act, decide, and reason — within deliberately drawn limits.


What this project demonstrates

  • Tool-calling agent — DM agent decides what populates each room via Python tools, never by mutating state directly
  • RAG-grounded lore oracle — answers world questions from a local vector store, refuses gracefully when knowledge isn't there
  • Persona-driven NPC dialog — two NPCs with distinct voices and bounded knowledge scopes
  • Constrained narrator — every action gets atmospheric prose, validated against constraints (HP leak detection, length, format)
  • Multi-role orchestration — four LLMs with one job each, no role bleed
  • Full observability — every prompt, tool call, raw response, and eval result is traceable
  • Local ML pipeline — ChromaDB + sentence-transformers, no cloud dependency for retrieval

Architecture

Player action → Engine resolves deterministically
                    ↓
         ┌──────────┼──────────┐
         ↓          ↓          ↓
    Narrator   DM Agent    NPC dialog
    (prose)    (tools)     (persona)
                    ↓
              RAG oracle
              (lore queries)
LLM Role Pattern Purpose
DM Agent Tool-calling Populates rooms via spawn_enemy, add_item_to_room, describe_room
Narrator Structured JSON output 2-3 sentence prose per action, with eval
NPC Dialog Persona injection + history In-character conversations within knowledge bounds
Lore Oracle RAG Answers world questions from local lore docs

The engine never asks the LLM what should happen. The LLM only requests changes through tools, narrates outcomes, voices characters, or retrieves lore.


Project Structure

dnd_agentic_llm/
│
├── engine/                  # Pure Python — no LLM
│   ├── state.py             # All Pydantic models
│   ├── combat.py            # Graduated d20 combat
│   ├── actions.py           # Move, use, cast, flee
│   ├── map.py               # 3x3 dungeon grid
│   └── inventory.py         # Items + equipment
│
├── llm/
│   ├── dm_agent.py          # Tool-calling room populator
│   ├── narrator.py          # Constrained JSON narration
│   ├── npc.py               # Persona-driven dialog
│   ├── rag.py               # ChromaDB + sentence-transformers
│   ├── memory.py            # Rolling summarization
│   └── evaluator.py         # Output constraint checks
│
├── ui/
│   └── app.py               # Streamlit CRPG-style interface
│
├── game/
│   └── loop.py              # CLI fallback
│
├── data/
│   ├── lore/                # Markdown source documents
│   │   ├── world_lore.md
│   │   ├── bestiary.md
│   │   ├── item_glossary.md
│   │   └── npcs_and_rooms.md
│   └── chroma_db/           # Auto-generated vector store
│
├── tests/                   # 30+ unit tests
└── cli.py                  # CLI entry point

Quickstart

git clone https://github.com/yourname/dnd-agentic-llm.git
cd dnd-agentic-llm

python -m venv .venv
.venv\Scripts\activate           # Windows
# source .venv/bin/activate      # macOS / Linux

pip install -r requirements.txt

Set your API key in .env:

ANTHROPIC_API_KEY=your_key_here

Get one at console.anthropic.com.

Run the UI:

streamlit run ui/app.py

Run the CLI fallback:

python cli.py

First run will download the embedding model (~80MB) and build the vector index from the lore files. Takes ~30 seconds.


Running Tests

pytest tests/ -v

Tests cover the deterministic engine, tool execution, RAG chunking, evaluator, and NPC personas. LLM calls are not unit tested — they're validated via the in-app debug panel.


Key Features

The four LLM roles in action

DM Agent populates rooms when the player first enters them. It cannot mutate state directly — only request changes through tools:

spawn_enemy(room_id="1_1", enemy_type="goblin", count=2)
add_item_to_room(room_id="1_1", item_id="rusty_sword")
describe_room(room_id="1_1", description="Dust motes drift through...")

Narrator turns engine output into prose with strict constraints:

{
  "narration": "The blade finds its mark. The goblin staggers, wounded.",
  "tone": "tense",
  "success": true
}

NPCs speak from injected personas with bounded knowledge. Aldric the merchant talks like a nervous trader; Garron the veteran guard speaks in clipped military phrasing. Both refuse to discuss things outside their knows_about scope, in character.

Lore Oracle answers world questions via RAG. If the answer isn't in the retrieved chunks: "The chronicles do not speak of this." No hallucination. The LLM is constrained with prompt engineering (no hard mechanism). In this case, the LLM still has access to its knowledge but is required to ONLY use the retrieved lore chunks.

Observability

Every LLM call is logged to a debug panel:

  • Which prompt version produced what output
  • Which tools the DM agent called
  • Whether each narration passed evaluation (HP leak, sentence count, format)
  • Token usage and conversation summaries

Graceful degradation

API failures retry with exponential backoff. If retries exhaust, deterministic fallback narration fires. The game never crashes on a bad LLM response.


Design Rules (non-negotiable)

  1. LLM never modifies directly state — only via tool calls, narration, or dialog
  2. Engine resolves all mechanics — dice rolls, damage, movement
  3. Narrator never invents facts — describes only what the engine produced
  4. NPCs stay in scope — refuse cleanly when asked outside their knowledge
  5. RAG oracle stays grounded — answers only from retrieved chunks (is constrained to lore)
  6. Player owns agency — movement, item use, casting, talking, fighting are always player choices
  7. DM agent owns the world — what's in rooms, who NPCs are, what events trigger

Roadmap

  • Middle and deeper layer content (already sketched in lore)
  • Trading system so Aldric can actually sell items
  • More NPCs (Mira the Sage, Lethiel of the Shards, Thimble the Fae)
  • Enemy AI variation (some flee, some call reinforcements)
  • Save/load runs
  • Multi-session memory across dungeon runs

Estimated API Cost

Using claude-haiku-4-5-20251001:

Activity Approximate cost
One full dungeon run (20-30 actions) < $0.05
Full evening of testing < $0.50

Local RAG (ChromaDB + sentence-transformers) costs nothing per query.


Requirements

anthropic
streamlit
pydantic
python-dotenv
chromadb
sentence-transformers
pytest

Generate from your env:

pip freeze > requirements.txt

What I learned

The hard part wasn't the LLM, it was working around it. Most of the work went into state controls, schemas, and the evaluator — not prompt engineering. The LLM ends up behaving like any other unreliable service, and you can treat it like one.

Splitting work across multiple specialized LLMs beat one general-purpose call. Four narrow roles produced better output than one prompt trying to do everything, and the prompts themselves got simpler.

Prompt-only grounding worked, but I wouldn't trust it for higher stakes. Telling the oracle to "only use this context" was effective in practice for this project. I didn't stress-test it adversarially though, and for anything correctness-critical I'd want an enforceable mechanism, not a polite instruction. Observability paid for itself fast. Logging every prompt, tool call, and eval verdict turned "that response was weird" into reproducible bugs. Day-one work on any future LLM project.

What I'd do differently

Rethink the UI from scratch, not improve Streamlit. Most of my testing happened in the CLI, and it honestly felt better than the Streamlit interface. Streamlit is great for inspecting evaluator output and debug data — but a real frontend for this kind of game isn't a repurposed dashboard, it's a proper game UI. That's a different project.

Author the lore corpus for retrieval, not for humans. Right now it reads as natural prose, then gets chunked. Writing it as self-contained passages with consistent entity naming would likely improve grounding without touching any prompts.

Reach for a real agent framework on a v3. Hand making the orchestration was the right call for learning. But features like persistent memory across sessions, self-correction and more would justify using something more sophisticated.

About

Working in a fun way with LLMs agents and RAG.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages