Portable evaluation bundles for agents and agent-shaped workflows: bounded, reproducible, regression-aware proof surfaces for quality claims.
-
Updated
May 7, 2026 - Python
Portable evaluation bundles for agents and agent-shaped workflows: bounded, reproducible, regression-aware proof surfaces for quality claims.
Lightweight proof-of-concept for oversight-centered metrology in coding agents: workflow-aware evaluation, interrupt channels, and claim-margin reporting beyond raw success scores.
Preprint paper package — Lightweight Evaluation and Operational Scorecards for Tool-Using AI Agents (Zenodo DOI 10.5281/zenodo.20034550)
Reproducible PoC comparing certified reusable workflows against flat direct control on bounded repository-maintenance tasks, with deterministic audit, replay, drift, and maintenance.
Add a description, image, and links to the workflow-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the workflow-evaluation topic, visit your repo's landing page and select "manage topics."