Evaluation infrastructure for agent products.
Use it to wrap the real workflow your users run, record what happened, verify the result, turn feedback into replay data, compare variants, and ship only when the evidence improves.
product task
-> observe state
-> validate with deterministic gates first
-> act through the real product adapter
-> trace + feedback trajectory
-> replay / optimize / release gateagent-eval does not own product state, credentials, UI, storage, model
routing, browser drivers, sandbox policy, or deployment. Products own those.
This package owns eval contracts, loop mechanics, traces, statistics,
optimization inputs, and release evidence.
pnpm add @tangle-network/agent-evalimport {
objectiveEval,
runAgentControlLoop,
} from '@tangle-network/agent-eval/control'
const result = await runAgentControlLoop({
intent: task.prompt,
budget: { maxSteps: 8, maxWallMs: 180_000, maxCostUsd: 2 },
observe() {
return product.readState(task.id)
},
validate({ state }) {
return [
objectiveEval({
id: 'build-passes',
passed: state.build.exitCode === 0,
severity: 'critical',
metadata: state.build,
}),
objectiveEval({
id: 'preview-serves',
passed: state.preview.httpStatus === 200,
severity: 'critical',
}),
]
},
decide({ evals }) {
const failed = evals.filter((e) => !e.passed)
if (failed.length === 0) {
return { type: 'stop', pass: true, reason: 'all gates passed' }
}
return {
type: 'continue',
action: { type: 'repair', failed: failed.map((e) => e.id) },
reason: 'repair failed gates',
}
},
act(action) {
return product.runAgentStep(task.id, action)
},
})
await product.storeEvalResult(task.id, result)That loop should be the same shape in production, replay, benchmark, and
optimization. Swap dependencies behind observe() and act(), not the eval
contract itself.
The root export remains available, but new code should prefer focused subpaths:
import { runAgentControlLoop } from '@tangle-network/agent-eval/control'
import { runMultiShotOptimization } from '@tangle-network/agent-eval/optimization'
import { TraceEmitter } from '@tangle-network/agent-eval/traces'
import { renderReleaseReport } from '@tangle-network/agent-eval/reporting'| Subpath | Use for |
|---|---|
@tangle-network/agent-eval/control |
observe -> validate -> decide -> act, action policy, propose/review loops |
@tangle-network/agent-eval/traces |
trace stores, emitters, TraceAnalyst |
@tangle-network/agent-eval/optimization |
feedback trajectories, multi-shot optimization, prompt evolution |
@tangle-network/agent-eval/reporting |
release confidence, paired stats, report/table/chart specs |
@tangle-network/agent-eval/wire |
HTTP/RPC judge server and schemas |
@tangle-network/agent-eval/benchmarks |
benchmark adapter contracts and reference wrappers |
| Need | Use |
|---|---|
| Keep an agent working until objective state passes | runAgentControlLoop |
| Turn user/reviewer feedback into replay data | FeedbackTrajectory |
| Compare prompt/tool/retrieval policies over full trajectories | runMultiShotOptimization |
| Gate releases with paired evidence and holdouts | evaluateReleaseConfidence, HeldOutGate |
| Explain regressions across trace corpora | TraceAnalyst / analyzeTraces |
| Report a launch decision | renderReleaseReport, summaryTable, paretoChart, gainHistogram |
| Model missing context separately from bad reasoning | KnowledgeRequirement, KnowledgeBundle |
Runnable examples live in
examples/.
examples/multi-shot-optimization: optimize full trajectories with held-out promotion.examples/same-sandbox-harness: run setup/build/test and evidence checks in one workspace.examples/benchmarks: benchmark adapter shape and reference wrappers.
Read in this order:
- Product Eval Adoption
- Control Runtime
- Feedback Trajectories
- Multi-Shot Optimization
- Trace Analysis
- Knowledge Readiness
- Integration Launch Gates
- Wire Protocol
npm i -g @tangle-network/agent-eval
agent-eval serve --port 5005The Python client lives in clients/python:
cd clients/python
pip install -e .pnpm install
pnpm typecheck
pnpm test
pnpm build
pnpm openapi@tangle-network/agent-runtime: production session/runtime layer.@tangle-network/agent-knowledge: source-grounded knowledge bases and readiness.@tangle-network/agent-integrations: connection, grant, capability, and integration invocation contracts.
MIT