@tangle-network/agent-eval

Evaluation infrastructure for agent products.

Use it to wrap the real workflow your users run, record what happened, verify the result, turn feedback into replay data, compare variants, and ship only when the evidence improves.

product task
  -> observe state
  -> validate with deterministic gates first
  -> act through the real product adapter
  -> trace + feedback trajectory
  -> replay / optimize / release gate

agent-eval does not own product state, credentials, UI, storage, model routing, browser drivers, sandbox policy, or deployment. Products own those. This package owns eval contracts, loop mechanics, traces, statistics, optimization inputs, and release evidence.

Install

pnpm add @tangle-network/agent-eval

Quick Start

import {
  objectiveEval,
  runAgentControlLoop,
} from '@tangle-network/agent-eval/control'

const result = await runAgentControlLoop({
  intent: task.prompt,
  budget: { maxSteps: 8, maxWallMs: 180_000, maxCostUsd: 2 },

  observe() {
    return product.readState(task.id)
  },

  validate({ state }) {
    return [
      objectiveEval({
        id: 'build-passes',
        passed: state.build.exitCode === 0,
        severity: 'critical',
        metadata: state.build,
      }),
      objectiveEval({
        id: 'preview-serves',
        passed: state.preview.httpStatus === 200,
        severity: 'critical',
      }),
    ]
  },

  decide({ evals }) {
    const failed = evals.filter((e) => !e.passed)
    if (failed.length === 0) {
      return { type: 'stop', pass: true, reason: 'all gates passed' }
    }
    return {
      type: 'continue',
      action: { type: 'repair', failed: failed.map((e) => e.id) },
      reason: 'repair failed gates',
    }
  },

  act(action) {
    return product.runAgentStep(task.id, action)
  },
})

await product.storeEvalResult(task.id, result)

That loop should be the same shape in production, replay, benchmark, and optimization. Swap dependencies behind observe() and act(), not the eval contract itself.

Import Paths

The root export remains available, but new code should prefer focused subpaths:

import { runAgentControlLoop } from '@tangle-network/agent-eval/control'
import { runMultiShotOptimization } from '@tangle-network/agent-eval/optimization'
import { TraceEmitter } from '@tangle-network/agent-eval/traces'
import { renderReleaseReport } from '@tangle-network/agent-eval/reporting'

Subpath	Use for
`@tangle-network/agent-eval/control`	`observe -> validate -> decide -> act`, action policy, propose/review loops
`@tangle-network/agent-eval/traces`	trace stores, emitters, TraceAnalyst
`@tangle-network/agent-eval/optimization`	feedback trajectories, multi-shot optimization, prompt evolution
`@tangle-network/agent-eval/reporting`	release confidence, paired stats, report/table/chart specs
`@tangle-network/agent-eval/wire`	HTTP/RPC judge server and schemas
`@tangle-network/agent-eval/benchmarks`	benchmark adapter contracts and reference wrappers

Core Pieces

Need	Use
Keep an agent working until objective state passes	`runAgentControlLoop`
Turn user/reviewer feedback into replay data	`FeedbackTrajectory`
Compare prompt/tool/retrieval policies over full trajectories	`runMultiShotOptimization`
Gate releases with paired evidence and holdouts	`evaluateReleaseConfidence`, `HeldOutGate`
Explain regressions across trace corpora	`TraceAnalyst` / `analyzeTraces`
Report a launch decision	`renderReleaseReport`, `summaryTable`, `paretoChart`, `gainHistogram`
Model missing context separately from bad reasoning	`KnowledgeRequirement`, `KnowledgeBundle`

Examples

Runnable examples live in examples/.

examples/multi-shot-optimization: optimize full trajectories with held-out promotion.
examples/same-sandbox-harness: run setup/build/test and evidence checks in one workspace.
examples/benchmarks: benchmark adapter shape and reference wrappers.

Docs

Read in this order:

CLI / Wire Protocol

npm i -g @tangle-network/agent-eval
agent-eval serve --port 5005

The Python client lives in clients/python:

cd clients/python
pip install -e .

Development

pnpm install
pnpm typecheck
pnpm test
pnpm build
pnpm openapi

Related Packages

@tangle-network/agent-runtime: production session/runtime layer.
@tangle-network/agent-knowledge: source-grounded knowledge bases and readiness.
@tangle-network/agent-integrations: connection, grant, capability, and integration invocation contracts.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.claude/skills/agent-eval		.claude/skills/agent-eval
.github/workflows		.github/workflows
clients/python		clients/python
docs		docs
examples		examples
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

@tangle-network/agent-eval

Install

Quick Start

Import Paths

Core Pieces

Examples

Docs

CLI / Wire Protocol

Development

Related Packages

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

@tangle-network/agent-eval

Install

Quick Start

Import Paths

Core Pieces

Examples

Docs

CLI / Wire Protocol

Development

Related Packages

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages