agent-eval helps engineers and reviewers turn local coding-agent transcripts into deterministic, public-safe HTML reports without uploading raw session logs or relying on anecdotal scoring.
This screenshot shows the generated static HTML dashboard: navigation pills, metric cards, and verification status styling rendered from the same deterministic artifacts that the CLI writes to disk. The same report continues with SVG charts, exemplar sessions, review queue, attribution, and methodology sections. The preview image was generated from checked-in synthetic Codex fixtures, not from private local transcripts.
The HTML report helps readers see:
- how many sessions and write sessions were analyzed
- how often write sessions ended verified or unverified
- provider, harness, tool-family, and attribution mixes
- exemplar sessions worth learning from
- review-queue sessions worth inspecting
It does not claim anything about your private transcripts until you run inspect or eval against your own local agent home.
- Engineering leads adopting coding agents who need repeatable usage evidence.
- Developers who want to review their own Codex, Claude Code, pi, or opencode sessions.
- Evaluators and portfolio reviewers who want proof beyond screenshots or informal impressions.
- Maintainers who need static artifacts that can be inspected without a hosted service.
Developer-agent work leaves useful evidence in local transcripts, but that evidence is hard to compare:
- each provider stores sessions differently
- raw logs are noisy and often sensitive
- anecdotes do not show verification, friction, or review-worthy patterns
- dashboards are not trustworthy if they are detached from auditable artifacts
agent-eval keeps transcript/session artifacts as the canonical input, normalizes supported providers into one model, and emits deterministic JSON/JSONL plus static Markdown/HTML/SVG reports.
| Problem | Capability | Proof |
|---|---|---|
| Agent homes are provider-specific | Discovers Codex, Claude Code, pi, and opencode transcript stores through source-aware adapters | pnpm inspect --source opencode --home ~/.local/share/opencode |
| Raw transcripts are too noisy to review directly | Normalizes sessions, turns, tools, labels, compliance, attribution, and session facts | docs/schema-v3.md |
| Teams need a quick read of usage patterns | Generates a static dashboard with overview, what worked, needs review, and why it happened | docs/report-v3.md |
| Public sharing can leak too much context | Uses redacted, truncated previews and local public-surface scan commands | pnpm scan:repo and pnpm scan:artifacts <path> |
| Rule changes need regression protection | Runs a synthetic calibration benchmark across expected labels, incidents, attribution, and surfaced sessions | pnpm benchmark |
Requirements:
- Node.js 22.16+
- pnpm 10+
This repository is GitHub-first and is not currently published to npm.
git clone https://github.com/fitchmultz/agent-eval.git
cd agent-eval
pnpm install
rm -rf artifacts/demo-home artifacts/demo-report
mkdir -p artifacts/demo-home/sessions
cp src/calibration/fixtures/codex-*.jsonl artifacts/demo-home/sessions/
pnpm eval --source codex --home artifacts/demo-home --output-dir artifacts/demo-report --summary-onlyExpected result: a generated static report at artifacts/demo-report/report.html, plus the canonical JSON/JSONL/SVG artifact bundle in the same directory. Open that HTML file in your browser to inspect the dashboard, charts, review queue, and methodology sections.
For a faster calibration-only check that does not generate the HTML report, run pnpm benchmark.
Start with discovery. It inventories canonical transcripts plus optional enrichment stores when present.
pnpm inspect --source codex --home ~/.codex
pnpm inspect --source claude --home ~/.claude
pnpm inspect --source pi --home ~/.pi
pnpm inspect --source opencode --home ~/.local/share/opencodeRun the full deterministic pipeline and open the static report from artifacts/report.html.
pnpm eval --source pi --home ~/.pi --output-dir artifacts --summary-onlyUse date filters and a time bucket when you want a bounded corpus:
pnpm eval --source pi --home ~/.pi \
--output-dir artifacts \
--summary-only \
--start-date 2026-03-01 \
--end-date 2026-03-31 \
--time-bucket dayBy default, CLI evaluation uses a bounded window of the 100 most recent discovered sessions for fast local feedback. Use --all for full-history runs; large full-history corpora use a multi-pass memory-bounded path that preserves corpus-relative template detection without retaining every parsed transcript at once. When --session-limit is set, the limit applies to the most recent discovered sessions after date filtering.
codex: canonical transcripts under~/.codex/sessions/**/*.jsonlclaude: canonical transcripts under~/.claude/projects/**/*.jsonlpi: canonical transcripts under~/.pi/agent/sessions/**/*.jsonlopencode: canonical session metadata under~/.local/share/opencode/storage/session/**/*.json, joined with relatedstorage/messageandstorage/partrecords
Optional enrichment stores such as history, SQLite, shell snapshots, opencode databases/logs, and session environment files are inventoried when present, but transcript/session artifacts remain the canonical input.
pnpm inspect --source pi --home ~/.pi
pnpm inspect --source opencode --home ~/.local/share/opencode
pnpm parse --source codex --home ~/.codex --output-dir artifacts
pnpm eval --source claude --home ~/.claude --output-dir artifacts
pnpm report --source codex --home ~/.codex --output-dir artifacts
pnpm benchmarkUse parse when you only want normalized turn reconstruction; it writes raw-turns.jsonl and parse-metrics.json without scoring or report generation. Use report when you want the Markdown report on stdout while still writing the full evaluation artifact bundle.
Example local config:
cp .agent-evalrc.example .agent-evalrcBuilt binary smoke path:
pnpm build
node dist/cli.js inspect --source pi --home ~/.pieval and report always emit the canonical v3 bundle:
metrics.jsonsummary.jsonsession-facts.jsonlrelease-manifest.json
Every machine-readable output includes engineVersion and schemaVersion. release-manifest.json also records release provenance such as the current git revision when available, dirty-worktree state, config fingerprint, evaluation parameters, and emitted artifact inventory.
Presentation outputs:
report.mdreport.htmlfavicon.icofavicon.svgsessions-over-time.svgprovider-share.svgharness-share.svgtool-family-share.svgattribution-mix.svg
Full mode also emits heavier drilldown files:
raw-turns.jsonlincidents.jsonl
--summary-only is the preferred mode for large corpora. It skips the heavier raw-turn and incident JSONL files while preserving the dashboard, session facts, release manifest, and static reports.
Read the v3 report top to bottom:
- Overview Dashboard — corpus metrics and diagnostic context.
- What Worked — exemplar sessions and learning surfaces.
- Needs Review — ranked sessions worth deeper inspection.
- Why This Happened — attribution and template-substrate context.
- Comparative Slices — deterministic slice comparisons.
- Methodology And Limitations — caveats and scope.
- Inventory — discovered local inputs.
If a presentation artifact ever disagrees with the JSON artifacts, treat the JSON artifacts as canonical.
source home
-> discovery inventory
-> source-specific parser
-> normalized sessions + turns
-> labels + incidents + compliance
-> metrics + summary artifact + session facts
-> markdown/html/svg reports
Key implementation choices:
- Transcript-first: canonical analytics starts from transcript/session artifacts, not optional side stores.
- Source-aware adapters: Codex, Claude Code, pi, and opencode use separate discovery/parsing logic, then converge on one normalized session model.
- Deterministic scoring: labels, clustering, compliance scoring, summaries, and presentation artifacts are rule-based.
- Redacted previews: generated reports prefer redacted, truncated previews over full transcript bodies.
- Static export: HTML, Markdown, JSON, JSONL, and SVG outputs stay portable and dependency-light.
| Path | Purpose |
|---|---|
src/cli/ |
CLI command wiring and option handling |
src/discovery.ts and src/sources.ts |
Source discovery and supported-provider definitions |
src/transcript/ |
Transcript event normalization helpers |
src/evaluator.ts |
Evaluation orchestration from discovery through artifact generation |
src/metrics-aggregation.ts |
Corpus metrics aggregation |
src/summary*.ts and src/summary/ |
Canonical summary and presentation section derivation |
src/html-report/, src/report.ts, src/svg-charts.ts |
Static HTML, Markdown, and SVG presentation outputs |
src/calibration/ |
Synthetic benchmark fixtures and calibration runner |
tests/ |
Synthetic fixture-based regression tests |
docs/schema-v3.md |
Canonical artifact contract |
docs/report-v3.md |
Report product model |
docs/case-study.md |
Architecture and design rationale |
Baseline local gate:
make ciEquivalent explicit command:
pnpm lint && pnpm typecheck && pnpm test && pnpm benchmark && pnpm scan:artifacts artifacts/benchmark && pnpm scan:repo && pnpm build && pnpm smoke:distUseful focused commands:
pnpm test
pnpm benchmark
pnpm scan:artifacts artifacts/benchmark
pnpm scan:repomake ci is not formatting-mutating, but it does generate benchmark and dist smoke artifacts as part of validation. Use make bootstrap for first-time dependency setup and make fix when you want formatting rewrites.
- Tests use synthetic fixtures only; no private transcript corpora are committed.
- Local evaluation outputs, local agent homes, and temporary analysis material stay untracked.
- Generated report previews are redacted and truncated, but they are not a substitute for full secret scanning.
- Use
pnpm scan:artifacts <path...>before sharing generated.json,.jsonl,.md,.html, or.svgartifacts. - Use
pnpm scan:repobefore publishing repo changes. - Visual QA screenshots should stay local under ignored
notes/**/verification/paths unless intentionally curated.
Real today:
- local CLI for Codex, Claude Code, pi, and opencode transcript/session artifacts
- deterministic parsing, labeling, clustering, scoring, and attribution
- static HTML/Markdown/SVG report generation
- canonical v3 machine-readable artifacts
- synthetic fixture coverage and calibration benchmark
Not in scope yet:
- semantic proof that repository state at the end of a session is correct
- deep optional-store joins beyond transcript-first enrichment
- model-graded interpretation of transcript quality
- hosted dashboards or live multi-user service behavior
- npm package publication
docs/schema-v3.md— canonical artifact contractdocs/report-v3.md— report/product modeldocs/case-study.md— architecture and design context
For a final public-release pass, regenerate the provider QA bundles locally under artifacts/, review them, then run:
make release-checkmake release-check extends the baseline gate by requiring a clean main worktree that matches upstream, validating final QA manifests, scanning benchmark and final QA artifacts, and rerunning the clean/upstream check after validation.
Generate the synthetic HTML preview first, then run inspect against the local agent home you actually use and open artifacts/report.html after eval.
