Skip to content

tangle-network/agent-eval

Repository files navigation

@tangle-network/agent-eval

Evaluation infrastructure for agent products.

Use it to wrap the real workflow your users run, record what happened, verify the result, turn feedback into replay data, compare variants, and ship only when the evidence improves.

product task
  -> observe state
  -> validate with deterministic gates first
  -> act through the real product adapter
  -> trace + feedback trajectory
  -> replay / optimize / release gate

agent-eval does not own product state, credentials, UI, storage, model routing, browser drivers, sandbox policy, or deployment. Products own those. This package owns eval contracts, loop mechanics, traces, statistics, optimization inputs, and release evidence.

Install

pnpm add @tangle-network/agent-eval

Quick Start

import {
  objectiveEval,
  runAgentControlLoop,
} from '@tangle-network/agent-eval/control'

const result = await runAgentControlLoop({
  intent: task.prompt,
  budget: { maxSteps: 8, maxWallMs: 180_000, maxCostUsd: 2 },

  observe() {
    return product.readState(task.id)
  },

  validate({ state }) {
    return [
      objectiveEval({
        id: 'build-passes',
        passed: state.build.exitCode === 0,
        severity: 'critical',
        metadata: state.build,
      }),
      objectiveEval({
        id: 'preview-serves',
        passed: state.preview.httpStatus === 200,
        severity: 'critical',
      }),
    ]
  },

  decide({ evals }) {
    const failed = evals.filter((e) => !e.passed)
    if (failed.length === 0) {
      return { type: 'stop', pass: true, reason: 'all gates passed' }
    }
    return {
      type: 'continue',
      action: { type: 'repair', failed: failed.map((e) => e.id) },
      reason: 'repair failed gates',
    }
  },

  act(action) {
    return product.runAgentStep(task.id, action)
  },
})

await product.storeEvalResult(task.id, result)

That loop should be the same shape in production, replay, benchmark, and optimization. Swap dependencies behind observe() and act(), not the eval contract itself.

Import Paths

The root export remains available, but new code should prefer focused subpaths:

import { runAgentControlLoop } from '@tangle-network/agent-eval/control'
import { runMultiShotOptimization } from '@tangle-network/agent-eval/optimization'
import { TraceEmitter } from '@tangle-network/agent-eval/traces'
import { renderReleaseReport } from '@tangle-network/agent-eval/reporting'
Subpath Use for
@tangle-network/agent-eval/control observe -> validate -> decide -> act, action policy, propose/review loops
@tangle-network/agent-eval/traces trace stores, emitters, TraceAnalyst
@tangle-network/agent-eval/optimization feedback trajectories, multi-shot optimization, prompt evolution
@tangle-network/agent-eval/reporting release confidence, paired stats, report/table/chart specs
@tangle-network/agent-eval/wire HTTP/RPC judge server and schemas
@tangle-network/agent-eval/benchmarks benchmark adapter contracts and reference wrappers

Core Pieces

Need Use
Keep an agent working until objective state passes runAgentControlLoop
Turn user/reviewer feedback into replay data FeedbackTrajectory
Compare prompt/tool/retrieval policies over full trajectories runMultiShotOptimization
Gate releases with paired evidence and holdouts evaluateReleaseConfidence, HeldOutGate
Explain regressions across trace corpora TraceAnalyst / analyzeTraces
Report a launch decision renderReleaseReport, summaryTable, paretoChart, gainHistogram
Model missing context separately from bad reasoning KnowledgeRequirement, KnowledgeBundle

Examples

Runnable examples live in examples/.

Docs

Read in this order:

  1. Product Eval Adoption
  2. Control Runtime
  3. Feedback Trajectories
  4. Multi-Shot Optimization
  5. Trace Analysis
  6. Knowledge Readiness
  7. Integration Launch Gates
  8. Wire Protocol

CLI / Wire Protocol

npm i -g @tangle-network/agent-eval
agent-eval serve --port 5005

The Python client lives in clients/python:

cd clients/python
pip install -e .

Development

pnpm install
pnpm typecheck
pnpm test
pnpm build
pnpm openapi

Related Packages

  • @tangle-network/agent-runtime: production session/runtime layer.
  • @tangle-network/agent-knowledge: source-grounded knowledge bases and readiness.
  • @tangle-network/agent-integrations: connection, grant, capability, and integration invocation contracts.

License

MIT

About

Domain-agnostic evaluation framework for Tangle agent apps

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors