Skip to content

ylaufer/agenteval-live

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

306 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

agenteval-live

A QA workbench for agentic AI. It turns live agent traces into a workflow QA teams already know — triage → reproduce → mutate → fix → regression — and puts statistical quality gates in front of every release.

License: MIT Status: alpha


What is this?

agenteval-live is an open-source QA tool built specifically for testing AI agents — the kind that plan, call tools, hold memory, and act on the outside world. It treats traces and sessions as test evidence, not just observability data, and gives QA, QE, and SDET teams a workflow they recognize: triage → reproduce → mutate → fix → regression.

It organizes agent QA into five domains, so coverage is something you can see and reason about rather than guess at:

# Domain Question
1 Policy Boundary Can the system do this?
2 Reasoning & Orchestration Did it choose a reasonable path?
3 Memory & Context Did it use the right context?
4 Action Layer Did it act correctly on the world?
5 Telemetry & Continuous Eval How do we know what really happened?

For each domain you get risks, evidence, oracles, roles, and tools — turned into actual screens, not slideware.


What you can do with it

  • 📡 Stream agent traces in real time via OpenTelemetry — point your agents at the collector and watch them run.
  • 🔍 Inspect every trajectory span-by-span, with domain-tagged annotations and timing budgets.
  • 🔄 Mutate and replay traces — edit user input and re-run the judge to see how verdicts change without re-invoking the agent.
  • 🛡️ Run adversarial datasets against any deployed agent — jailbreaks, prompt injection, PII leakage suites.
  • 🧪 Contract-test tool calls with Tool Call Accuracy / F1 over N runs.
  • 📏 Define and manage journey invariants from the UI — required spans, forbidden spans, timing budgets — and dry-run them against any trace before saving.
  • 🩺 Find root causes across domains with Why Judge — trace a failure back to the domain and span that caused it.
  • 🔁 Convert any failed trace into a regression test with one click.
  • 📊 Compare two versions of an agent (prompt / model / tool def) over N runs with statistical thresholds, not single-run asserts.
  • Wire quality gates into CI/CD via the included CLI.
  • 🎭 Test for fairness across 14 synthetic-user personas spanning language, age, tech-literacy, and intent.
  • 📋 Score your team's coverage of all five domains in a single view.
  • ⚙️ Configure evaluators per project — toggle built-in evaluators on/off and write declarative custom rules (span presence, attribute matching, thresholds, ordering).
  • ✏️ Customize the LLM judge prompt per project — override the system prompt and reset to default from the Settings page.
  • 🗂️ Organize by project — filter every view (Live Console, Policy Lab, Traces) by project. Rename or delete projects and manage pass-rate alert webhooks from Settings.
  • 🏠 Project dashboard — land on a per-project overview when you switch projects: trace count, pass rate, active evaluators, invariant count, last 5 traces, and quick links to every page.

Five runnable mock agents ship out of the box, so you can start testing without writing an agent first.


Quickstart

Docker (recommended)

git clone https://github.com/ylaufer/agenteval-live.git
cd agenteval-live

# 1. Configure — add your ANTHROPIC_API_KEY for LLM-as-judge
cp .env.example .env
# edit .env

# 2. Build the frontend, then start all services
pnpm install
cd apps/web && pnpm build && cd ../..
docker compose build
docker compose up -d

# 3. Open the UI
open http://localhost:3000        # macOS
# xdg-open http://localhost:3000  # Linux
# start http://localhost:3000     # Windows

The OTLP collector listens on localhost:4318 (HTTP) and 4317 (gRPC). Point any OpenTelemetry-instrumented agent there and you're done.

Self-hosted deployment security: By default the ingest endpoint (POST /v1/traces) accepts any request. For local development the API binds to 127.0.0.1 and this is fine. To expose it on the network you must either set INGEST_SECRET=<strong-random-value> (the API then requires X-Ingest-Secret: <value> on every write), set HOSTED_MODE=true (API key auth), or explicitly opt in via ALLOW_UNAUTH_NETWORK=true (only safe when an upstream proxy handles auth). The startup validator refuses to launch if BIND_HOST is non-loopback and none of the three are configured.

Why the explicit frontend build? The web Dockerfile copies pre-built .next/ artifacts rather than running next build itself. Re-run the build step (below) whenever you change web code or add a frontend dependency.

Updating after frontend changes

pnpm install                          # sync lockfile if package.json changed
cd apps/web && pnpm build && cd ../.. # rebuild .next/ artifacts
docker compose build web              # bake artifacts into the image
docker compose up -d web              # restart the container

Local development (no Docker)

# 1. Create and activate a virtualenv
python -m venv .venv
source .venv/bin/activate           # macOS / Linux / Git Bash
# .venv\Scripts\activate            # Windows PowerShell / cmd

# 2. Install Python dependencies
pip install --upgrade pip
pip install -e ".[dev]"

# 3. Install frontend deps
cd apps/web && pnpm install && cd ../..

# 4. Apply migrations and run dev servers
make migrate
make dev

Running the bundled mocks

Five deterministic mock agents ship with realistic data — products, order IDs, medical terminology, and multi-turn scenarios spanning all five QA domains:

source .venv/bin/activate

# Domain 1: Refund Bot (policy boundary + guardrails)
python -m mocks.refund_bot.agent --case happy              # ✅ pass
python -m mocks.refund_bot.agent --case cross_tenant       # ✅ pass (with recovery)
python -m mocks.refund_bot.agent --case jailbreak_basic    # ✅ pass (blocked cleanly, injection_attempt_v1 → pass)
python -m mocks.refund_bot.agent --all-happy               # 8 cases

# Domain 1: Healthcare Triage (policy boundary + scope)
python -m mocks.healthcare_triage.agent --case intake_clean_es         # ✅ pass (Spanish)
python -m mocks.healthcare_triage.agent --case oos_diagnosis           # ✅ pass (denied, in scope)
python -m mocks.healthcare_triage.agent --case injection_role_override # ❌ fail (blocked)
python -m mocks.healthcare_triage.agent --all                          # 10 cases

# Domain 3: Memory Probe (multi-turn, freshness, cross-user)
python -m mocks.memory_probe.agent --case mt-recall-001   # ✅ pass (multi-turn recall)
python -m mocks.memory_probe.agent --case stale-001       # ❌ fail (doc 72h old)
python -m mocks.memory_probe.agent --case leak-001        # ❌ fail (cross-user retrieval)
python -m mocks.memory_probe.agent --all-pass             # 3 pass cases

# Domain 2: Research Crew (reasoning, multi-agent coordination)
python -m mocks.research_crew.agent --case happy_market_analysis   # ✅ pass
python -m mocks.research_crew.agent --case shallow_analysis_trend  # ❌ fail (incomplete analysis)
python -m mocks.research_crew.agent --all-happy                    # 5 happy + recovery cases

# Fairness Testing: Synthetic Users (14 personas across language/age/tech/intent)
python -m mocks.synthetic_users.agent --service refund_bot --scenario refund_basic --all-personas
python -m mocks.synthetic_users.agent --service refund_bot --scenario refund_basic --genuine-only
python -m mocks.synthetic_users.agent --service refund_bot --scenario refund_basic --adversarial-only

Then watch the Live Console at http://localhost:3000/live. After ~5 seconds, each trace is scored by the LLM judge and verdicts appear in Policy Lab (/policy-lab) and in the Trajectory Inspector (/trace/{id}).


Bring your own agent

The collector speaks standard OTLP. Any agent already instrumented with OpenTelemetry will work — Python, TypeScript, Go, Java, anything.

If you're starting from zero, the included Python SDK is the lowest-friction path:

from agenteval_live.sdk import instrument

instrument(
    service_name="my-support-agent",
    endpoint="http://localhost:4318",
    domain_hints={
        "tool_call.*": "action",
        "guardrail.*": "policy",
        "retrieve.*":  "memory",
        "plan.*":      "reasoning",
    },
)

# Your agent code stays unchanged. Spans are emitted automatically.

Adapters are provided for: LangChain · LangGraph · CrewAI · OpenAI Agents SDK · Anthropic SDK · MCP servers.

See docs/bring-your-own-agent.md for full integration guides.


The screens

Fourteen screens across the five domains:

Screen Path Status What lives here
Dashboard /dashboard ✅ v0.3 Project overview — trace count, pass rate, active evaluators, invariant count, last 5 traces, and quick links to every page. Navigated to automatically when switching projects.
Live Console /live ✅ v0.1 Real-time stream of incoming traces. When a project is selected, only that project's traces appear (filtered client-side via service_id without reconnecting). Click any trace to open the Trajectory Inspector.
Trajectory Inspector /trace/[id] ✅ v0.1 Span-by-span breakdown of a single run. Domain tags, latency budget, judge scores (pass/warn/fail). Mutate input and replay to compare verdicts. Save any trace as a regression test.
Policy Lab /policy-lab ✅ v0.1 Adversarial dataset results grid. Verdict badges show judge verdicts for each trace. Click to inspect.
Action Sandbox /action-sandbox ✅ v0.1 Contract tests for tool calls. Tool Call Accuracy / F1 scoring, precision, recall, TP/FP/FN counts.
Scorecard /scorecard ✅ v0.1 Five-domain coverage view for your system. Per-domain pass rates and avg scores with per-evaluator stacked bars per service.
Runs /runs ✅ v0.1 Compare baseline vs current run over N executions. Statistical deltas, regression/improvement counts, pass rate drift trend chart.
Invariants /invariants ✅ v0.3 Full CRUD for journey invariants — create, edit, and delete rules from the UI. Required spans, forbidden spans, timing budgets. Dry-run against any trace before saving. Live violation feed.
Memory Probe /memory-probe ✅ v0.2 Multi-turn session explorer. Test multi-turn recall, stale context (with age-in-hours), cross-user leakage. Deterministic evaluators, session timeline with per-turn memory eval badges.
Synthetic Users /synthetic-users ✅ v0.2 Fairness testing across 14 personas (language, age, tech literacy, intent). Per-run agent decisions visible. Pass rate breakdown by intent, language, and tech literacy. Bias alerts.
Regressions /regressions ✅ v0.2 Saved regression test cases captured from real traces. Run checks (span presence, duration, per-domain score gates), view check history, delete stale cases.
Evaluators /evaluators ✅ v0.3 Toggle built-in evaluators on/off per project. Write declarative custom rules (span presence, attribute equality/regex, numeric thresholds, ordering, count) as JSON. No code execution.
Connect /connect ✅ v0.3 Record OTLP/SDK connection metadata for a project. Displays setup instructions for each connection type.
Settings /settings ✅ v0.3 API key management (hosted mode). Per-project LLM judge system prompt editor — override and reset to default. Projects panel — rename, delete, and configure pass-rate alert webhooks per project.

CLI

# CI/CD quality gate — compare baseline vs current run (exits 1 on regression)
agenteval-live gate --baseline production --current $GITHUB_SHA

# Project pass-rate check — assert a project is above a pass-rate threshold
# exits 0 (pass), 1 (fail), or 2 (API error / project not found)
agenteval-live check --project refund_bot --threshold 90
agenteval-live check --project refund_bot --threshold 95 --last 100

Both commands are designed for CI/CD — the exit code reflects the result so you can gate merges or deployments on agent quality.


Architecture

Your agents              agenteval-live
────────────             ──────────────────
┌──────────┐ OTLP/HTTP   ┌──────────────┐
│ agent A  │────────────▶│  Collector   │──┐
└──────────┘             │  + Bridge    │  │
┌──────────┐ OTLP/HTTP   └──────────────┘  │
│ agent B  │────────────────────────────────┤
└──────────┘                                ▼
                                    ┌──────────────┐
                                    │  WebSocket   │
                                    │     Hub      │
                                    └──────┬───────┘
                                           │
┌──────────┐    ┌──────────────────────────▼───────────┐
│ Postgres │◀───┤              FastAPI                  │
└──────────┘    │  ┌─────────┬──────────┬───────────┐   │
                │  │ Policy  │  Action  │  Memory   │   │
┌──────────┐    │  │  Eval   │   Eval   │   Eval    │   │
│   LLM    │◀───┤  └─────────┴──────────┴───────────┘   │
│  judge   │    │  ┌─────────────────────────────────┐  │
└──────────┘    │  │  Invariants · Scorecard · Diff  │  │
                │  └─────────────────────────────────┘  │
                └────────────────────┬──────────────────┘
                                     │ REST + WS
                                     ▼
                           ┌──────────────────┐
                           │  Next.js UI      │
                           │  (live + replay) │
                           └──────────────────┘

See docs/architecture.md for the full design.


Roadmap

  • v0.1 (complete) — ✅ Live ingest, trajectory inspector with domain tags, policy lab with adversarial cases, LLM-judge with verdicts, invariants engine, action sandbox, scorecard, run comparison, CLI gate, trace-to-test workflow, drift detection, Playwright smoke tests, visual polish.
  • v0.2 (complete) — ✅ Trace mutation & replay engine. ✅ Memory Probe (multi-turn recall, stale context, cross-user leakage). ✅ Research crew mock (Domain 2 reasoning & orchestration). ✅ Why Judge (cross-domain root cause analyzer). ✅ Hosted mode with API key auth. ✅ Synthetic users (14 personas, fairness breakdown, bias alerts). ✅ Per-evaluator scorecard breakdown. ✅ Regressions (save, check, check history, list UI). ✅ Help page with full evaluator reference.
  • v0.3 (in progress) — ✅ Invariants CRUD from the UI with dry-run. ✅ Custom declarative evaluators (6 rule types, per-project toggle). ✅ Editable LLM judge prompts per project. ✅ Agent connection registry. ✅ Project scoping — filter all views by project, rename/delete projects, pass-rate alert webhooks, CLI check subcommand. ✅ Project dashboard — per-project overview page with stats, recent traces, and quick links; project switcher navigates there on change. ✅ OTLP multi-trace grouping fix — exports with N distinct trace IDs now create N Trace rows with correctly partitioned spans, parent-child multi-turn links resolved in one pass, idempotent re-delivery. ✅ Judge budget & sampling — configurable sample rate (error/slow traces always judged), bounded 8-worker queue, per-(tenant, service) rolling-window call and token caps, Trace.judged / Trace.judge_skipped_reason visibility in Scorecard. Default behavior unchanged. Remaining: Anthropic SDK adapter (patch()/unpatch()), export/reporting (PDF/CSV scorecard), per-evaluator CI gates.
  • v1.0 — RBAC, multi-org billing, advanced audit logging.

The architecture is multi-tenant by design from day one (every record carries a tenant_id), so the path from self-hosted to hosted is incremental, not a rewrite.


Project status

Latest (2026-06-03): OTLP multi-trace grouping fix (exports with N trace IDs now create N Trace rows, parent-child links resolve in one pass, idempotent re-delivery) and judge budget & sampling (configurable sample rate, 8-worker bounded queue, per-tenant/service rolling-window call and token caps, Trace.judged / Trace.judge_skipped_reason visibility in Scorecard) shipped. Default behavior unchanged — all traces still judged unless configured otherwise. 227/227 backend tests passing.


Contributing

Issues and PRs welcome. See CONTRIBUTING.md.


License

MIT · Copyright (c) 2026 Yanina Laufer

About

AgentEval Live: a QA control plane for testing, evaluating, and governing AI agents from trace to release gate.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors