agenteval-live

A QA workbench for agentic AI. It turns live agent traces into a workflow QA teams already know — triage → reproduce → mutate → fix → regression — and puts statistical quality gates in front of every release.

What is this?

agenteval-live is an open-source QA tool built specifically for testing AI agents — the kind that plan, call tools, hold memory, and act on the outside world. It treats traces and sessions as test evidence, not just observability data, and gives QA, QE, and SDET teams a workflow they recognize: triage → reproduce → mutate → fix → regression.

It organizes agent QA into five domains, so coverage is something you can see and reason about rather than guess at:

#	Domain	Question
1	Policy Boundary	Can the system do this?
2	Reasoning & Orchestration	Did it choose a reasonable path?
3	Memory & Context	Did it use the right context?
4	Action Layer	Did it act correctly on the world?
5	Telemetry & Continuous Eval	How do we know what really happened?

For each domain you get risks, evidence, oracles, roles, and tools — turned into actual screens, not slideware.

What you can do with it

📡 Stream agent traces in real time via OpenTelemetry — point your agents at the collector and watch them run.
🔍 Inspect every trajectory span-by-span, with domain-tagged annotations and timing budgets.
🔄 Mutate and replay traces — edit user input and re-run the judge to see how verdicts change without re-invoking the agent.
🛡️ Run adversarial datasets against any deployed agent — jailbreaks, prompt injection, PII leakage suites.
🧪 Contract-test tool calls with Tool Call Accuracy / F1 over N runs.
📏 Define and manage journey invariants from the UI — required spans, forbidden spans, timing budgets — and dry-run them against any trace before saving.
🩺 Find root causes across domains with Why Judge — trace a failure back to the domain and span that caused it.
🔁 Convert any failed trace into a regression test with one click.
📊 Compare two versions of an agent (prompt / model / tool def) over N runs with statistical thresholds, not single-run asserts.
✅ Wire quality gates into CI/CD via the included CLI.
🎭 Test for fairness across 14 synthetic-user personas spanning language, age, tech-literacy, and intent.
📋 Score your team's coverage of all five domains in a single view.
⚙️ Configure evaluators per project — toggle built-in evaluators on/off and write declarative custom rules (span presence, attribute matching, thresholds, ordering).
✏️ Customize the LLM judge prompt per project — override the system prompt and reset to default from the Settings page.
🗂️ Organize by project — filter every view (Live Console, Policy Lab, Traces) by project. Rename or delete projects and manage pass-rate alert webhooks from Settings.
🏠 Project dashboard — land on a per-project overview when you switch projects: trace count, pass rate, active evaluators, invariant count, last 5 traces, and quick links to every page.

Five runnable mock agents ship out of the box, so you can start testing without writing an agent first.

Quickstart

Docker (recommended)

git clone https://github.com/ylaufer/agenteval-live.git
cd agenteval-live

# 1. Configure — add your ANTHROPIC_API_KEY for LLM-as-judge
cp .env.example .env
# edit .env

# 2. Build the frontend, then start all services
pnpm install
cd apps/web && pnpm build && cd ../..
docker compose build
docker compose up -d

# 3. Open the UI
open http://localhost:3000        # macOS
# xdg-open http://localhost:3000  # Linux
# start http://localhost:3000     # Windows

The OTLP collector listens on localhost:4318 (HTTP) and 4317 (gRPC). Point any OpenTelemetry-instrumented agent there and you're done.

Self-hosted deployment security: By default the ingest endpoint (POST /v1/traces) accepts any request. For local development the API binds to 127.0.0.1 and this is fine. To expose it on the network you must either set INGEST_SECRET=<strong-random-value> (the API then requires X-Ingest-Secret: <value> on every write), set HOSTED_MODE=true (API key auth), or explicitly opt in via ALLOW_UNAUTH_NETWORK=true (only safe when an upstream proxy handles auth). The startup validator refuses to launch if BIND_HOST is non-loopback and none of the three are configured.

Why the explicit frontend build? The web Dockerfile copies pre-built .next/ artifacts rather than running next build itself. Re-run the build step (below) whenever you change web code or add a frontend dependency.

Updating after frontend changes

pnpm install                          # sync lockfile if package.json changed
cd apps/web && pnpm build && cd ../.. # rebuild .next/ artifacts
docker compose build web              # bake artifacts into the image
docker compose up -d web              # restart the container

Local development (no Docker)

# 1. Create and activate a virtualenv
python -m venv .venv
source .venv/bin/activate           # macOS / Linux / Git Bash
# .venv\Scripts\activate            # Windows PowerShell / cmd

# 2. Install Python dependencies
pip install --upgrade pip
pip install -e ".[dev]"

# 3. Install frontend deps
cd apps/web && pnpm install && cd ../..

# 4. Apply migrations and run dev servers
make migrate
make dev

Running the bundled mocks

Five deterministic mock agents ship with realistic data — products, order IDs, medical terminology, and multi-turn scenarios spanning all five QA domains:

source .venv/bin/activate

# Domain 1: Refund Bot (policy boundary + guardrails)
python -m mocks.refund_bot.agent --case happy              # ✅ pass
python -m mocks.refund_bot.agent --case cross_tenant       # ✅ pass (with recovery)
python -m mocks.refund_bot.agent --case jailbreak_basic    # ✅ pass (blocked cleanly, injection_attempt_v1 → pass)
python -m mocks.refund_bot.agent --all-happy               # 8 cases

# Domain 1: Healthcare Triage (policy boundary + scope)
python -m mocks.healthcare_triage.agent --case intake_clean_es         # ✅ pass (Spanish)
python -m mocks.healthcare_triage.agent --case oos_diagnosis           # ✅ pass (denied, in scope)
python -m mocks.healthcare_triage.agent --case injection_role_override # ❌ fail (blocked)
python -m mocks.healthcare_triage.agent --all                          # 10 cases

# Domain 3: Memory Probe (multi-turn, freshness, cross-user)
python -m mocks.memory_probe.agent --case mt-recall-001   # ✅ pass (multi-turn recall)
python -m mocks.memory_probe.agent --case stale-001       # ❌ fail (doc 72h old)
python -m mocks.memory_probe.agent --case leak-001        # ❌ fail (cross-user retrieval)
python -m mocks.memory_probe.agent --all-pass             # 3 pass cases

# Domain 2: Research Crew (reasoning, multi-agent coordination)
python -m mocks.research_crew.agent --case happy_market_analysis   # ✅ pass
python -m mocks.research_crew.agent --case shallow_analysis_trend  # ❌ fail (incomplete analysis)
python -m mocks.research_crew.agent --all-happy                    # 5 happy + recovery cases

# Fairness Testing: Synthetic Users (14 personas across language/age/tech/intent)
python -m mocks.synthetic_users.agent --service refund_bot --scenario refund_basic --all-personas
python -m mocks.synthetic_users.agent --service refund_bot --scenario refund_basic --genuine-only
python -m mocks.synthetic_users.agent --service refund_bot --scenario refund_basic --adversarial-only

Then watch the Live Console at http://localhost:3000/live. After ~5 seconds, each trace is scored by the LLM judge and verdicts appear in Policy Lab (/policy-lab) and in the Trajectory Inspector (/trace/{id}).

Bring your own agent

The collector speaks standard OTLP. Any agent already instrumented with OpenTelemetry will work — Python, TypeScript, Go, Java, anything.

If you're starting from zero, the included Python SDK is the lowest-friction path:

from agenteval_live.sdk import instrument

instrument(
    service_name="my-support-agent",
    endpoint="http://localhost:4318",
    domain_hints={
        "tool_call.*": "action",
        "guardrail.*": "policy",
        "retrieve.*":  "memory",
        "plan.*":      "reasoning",
    },
)

# Your agent code stays unchanged. Spans are emitted automatically.

Adapters are provided for: LangChain · LangGraph · CrewAI · OpenAI Agents SDK · Anthropic SDK · MCP servers.

See docs/bring-your-own-agent.md for full integration guides.

REST API reference — endpoint guide and OpenAPI spec

The screens

Fourteen screens across the five domains:

Screen	Path	Status	What lives here
Dashboard	`/dashboard`	✅ v0.3	Project overview — trace count, pass rate, active evaluators, invariant count, last 5 traces, and quick links to every page. Navigated to automatically when switching projects.
Live Console	`/live`	✅ v0.1	Real-time stream of incoming traces. When a project is selected, only that project's traces appear (filtered client-side via `service_id` without reconnecting). Click any trace to open the Trajectory Inspector.
Trajectory Inspector	`/trace/[id]`	✅ v0.1	Span-by-span breakdown of a single run. Domain tags, latency budget, judge scores (pass/warn/fail). Mutate input and replay to compare verdicts. Save any trace as a regression test.
Policy Lab	`/policy-lab`	✅ v0.1	Adversarial dataset results grid. Verdict badges show judge verdicts for each trace. Click to inspect.
Action Sandbox	`/action-sandbox`	✅ v0.1	Contract tests for tool calls. Tool Call Accuracy / F1 scoring, precision, recall, TP/FP/FN counts.
Scorecard	`/scorecard`	✅ v0.1	Five-domain coverage view for your system. Per-domain pass rates and avg scores with per-evaluator stacked bars per service.
Runs	`/runs`	✅ v0.1	Compare baseline vs current run over N executions. Statistical deltas, regression/improvement counts, pass rate drift trend chart.
Invariants	`/invariants`	✅ v0.3	Full CRUD for journey invariants — create, edit, and delete rules from the UI. Required spans, forbidden spans, timing budgets. Dry-run against any trace before saving. Live violation feed.
Memory Probe	`/memory-probe`	✅ v0.2	Multi-turn session explorer. Test multi-turn recall, stale context (with age-in-hours), cross-user leakage. Deterministic evaluators, session timeline with per-turn memory eval badges.
Synthetic Users	`/synthetic-users`	✅ v0.2	Fairness testing across 14 personas (language, age, tech literacy, intent). Per-run agent decisions visible. Pass rate breakdown by intent, language, and tech literacy. Bias alerts.
Regressions	`/regressions`	✅ v0.2	Saved regression test cases captured from real traces. Run checks (span presence, duration, per-domain score gates), view check history, delete stale cases.
Evaluators	`/evaluators`	✅ v0.3	Toggle built-in evaluators on/off per project. Write declarative custom rules (span presence, attribute equality/regex, numeric thresholds, ordering, count) as JSON. No code execution.
Connect	`/connect`	✅ v0.3	Record OTLP/SDK connection metadata for a project. Displays setup instructions for each connection type.
Settings	`/settings`	✅ v0.3	API key management (hosted mode). Per-project LLM judge system prompt editor — override and reset to default. Projects panel — rename, delete, and configure pass-rate alert webhooks per project.

CLI

# CI/CD quality gate — compare baseline vs current run (exits 1 on regression)
agenteval-live gate --baseline production --current $GITHUB_SHA

# Project pass-rate check — assert a project is above a pass-rate threshold
# exits 0 (pass), 1 (fail), or 2 (API error / project not found)
agenteval-live check --project refund_bot --threshold 90
agenteval-live check --project refund_bot --threshold 95 --last 100

Both commands are designed for CI/CD — the exit code reflects the result so you can gate merges or deployments on agent quality.

Architecture

Your agents              agenteval-live
────────────             ──────────────────
┌──────────┐ OTLP/HTTP   ┌──────────────┐
│ agent A  │────────────▶│  Collector   │──┐
└──────────┘             │  + Bridge    │  │
┌──────────┐ OTLP/HTTP   └──────────────┘  │
│ agent B  │────────────────────────────────┤
└──────────┘                                ▼
                                    ┌──────────────┐
                                    │  WebSocket   │
                                    │     Hub      │
                                    └──────┬───────┘
                                           │
┌──────────┐    ┌──────────────────────────▼───────────┐
│ Postgres │◀───┤              FastAPI                  │
└──────────┘    │  ┌─────────┬──────────┬───────────┐   │
                │  │ Policy  │  Action  │  Memory   │   │
┌──────────┐    │  │  Eval   │   Eval   │   Eval    │   │
│   LLM    │◀───┤  └─────────┴──────────┴───────────┘   │
│  judge   │    │  ┌─────────────────────────────────┐  │
└──────────┘    │  │  Invariants · Scorecard · Diff  │  │
                │  └─────────────────────────────────┘  │
                └────────────────────┬──────────────────┘
                                     │ REST + WS
                                     ▼
                           ┌──────────────────┐
                           │  Next.js UI      │
                           │  (live + replay) │
                           └──────────────────┘

See docs/architecture.md for the full design.

Roadmap

v0.1 (complete) — ✅ Live ingest, trajectory inspector with domain tags, policy lab with adversarial cases, LLM-judge with verdicts, invariants engine, action sandbox, scorecard, run comparison, CLI gate, trace-to-test workflow, drift detection, Playwright smoke tests, visual polish.
v0.2 (complete) — ✅ Trace mutation & replay engine. ✅ Memory Probe (multi-turn recall, stale context, cross-user leakage). ✅ Research crew mock (Domain 2 reasoning & orchestration). ✅ Why Judge (cross-domain root cause analyzer). ✅ Hosted mode with API key auth. ✅ Synthetic users (14 personas, fairness breakdown, bias alerts). ✅ Per-evaluator scorecard breakdown. ✅ Regressions (save, check, check history, list UI). ✅ Help page with full evaluator reference.
v0.3 (in progress) — ✅ Invariants CRUD from the UI with dry-run. ✅ Custom declarative evaluators (6 rule types, per-project toggle). ✅ Editable LLM judge prompts per project. ✅ Agent connection registry. ✅ Project scoping — filter all views by project, rename/delete projects, pass-rate alert webhooks, CLI check subcommand. ✅ Project dashboard — per-project overview page with stats, recent traces, and quick links; project switcher navigates there on change. ✅ OTLP multi-trace grouping fix — exports with N distinct trace IDs now create N Trace rows with correctly partitioned spans, parent-child multi-turn links resolved in one pass, idempotent re-delivery. ✅ Judge budget & sampling — configurable sample rate (error/slow traces always judged), bounded 8-worker queue, per-(tenant, service) rolling-window call and token caps, Trace.judged / Trace.judge_skipped_reason visibility in Scorecard. Default behavior unchanged. Remaining: Anthropic SDK adapter (patch()/unpatch()), export/reporting (PDF/CSV scorecard), per-evaluator CI gates.
v1.0 — RBAC, multi-org billing, advanced audit logging.

The architecture is multi-tenant by design from day one (every record carries a tenant_id), so the path from self-hosted to hosted is incremental, not a rewrite.

Project status

Latest (2026-06-03): OTLP multi-trace grouping fix (exports with N trace IDs now create N Trace rows, parent-child links resolve in one pass, idempotent re-delivery) and judge budget & sampling (configurable sample rate, 8-worker bounded queue, per-tenant/service rolling-window call and token caps, Trace.judged / Trace.judge_skipped_reason visibility in Scorecard) shipped. Default behavior unchanged — all traces still judged unless configured otherwise. 227/227 backend tests passing.

Contributing

Issues and PRs welcome. See CONTRIBUTING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 306 Commits
.claude		.claude
.github		.github
.playwright-mcp		.playwright-mcp
apps		apps
cli		cli
config/invariants		config/invariants
docs		docs
infra		infra
mocks		mocks
packages		packages
scripts		scripts
test-results		test-results
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
app-test.png		app-test.png
docker-compose.yml		docker-compose.yml
help-contract-faq.png		help-contract-faq.png
help-contrast-final.png		help-contrast-final.png
help-faq-expanded.png		help-faq-expanded.png
help-glossary-new.png		help-glossary-new.png
help-glossary.png		help-glossary.png
help-ingest-faq.png		help-ingest-faq.png
help-page-light.png		help-page-light.png
help-pages-clean.png		help-pages-clean.png
help-pages-contrast.png		help-pages-contrast.png
help-pages-section.png		help-pages-section.png
light-mode-test.png		light-mode-test.png
live-console.png		live-console.png
memory-probe-fixed.png		memory-probe-fixed.png
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agenteval-live

What is this?

What you can do with it

Quickstart

Docker (recommended)

Updating after frontend changes

Local development (no Docker)

Running the bundled mocks

Bring your own agent

The screens

CLI

Architecture

Roadmap

Project status

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agenteval-live

What is this?

What you can do with it

Quickstart

Docker (recommended)

Updating after frontend changes

Local development (no Docker)

Running the bundled mocks

Bring your own agent

The screens

CLI

Architecture

Roadmap

Project status

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages