TDD for context files — measure whether your AGENTS.md, CLAUDE.md, and .cursorrules actually help on your repo.
Runs the same coding tasks under different context conditions (no context vs. full AGENTS.md vs. custom variants) and compares pass rate, token usage, wall time, and tool calls.
Same task, same agent, only context differs. That isolates context as the variable so you can see what actually moves the needle vs. what's noise.
References: Gloaguen et al., Zhang et al., Lulla et al., Sharma.
flowchart TD
subgraph you [You provide]
tasks([Tasks / Eval Set])
context([Context Files])
agent([Your Agent CLI])
config([Run Config])
end
subgraph crt [CRT handles]
worktree{{Git Worktree Isolation}}
cond_a[\No context/]
cond_b[\Full AGENTS.md/]
cond_c[\Custom variant/]
exec{{Agent Execution}}
accept{{Acceptance Check}}
end
subgraph results [You get back]
metrics[Metrics per condition]
comparison>Comparison Report]
end
config --> worktree
tasks --> worktree
context --> worktree
worktree --> cond_a & cond_b & cond_c
cond_a & cond_b & cond_c --> exec
agent --> exec
exec --> accept
accept --> metrics
metrics --> comparison
uv sync --extra dev
# Scaffold configs from your repo (detects AGENTS.md, .cursorrules, etc.)
uv run crt init . --test-cmd "pytest -x"
# Dry run (show work scope, no agents invoked)
uv run crt run --config crt-config.yaml --tasks crt-tasks.yaml --dry-run
# Real run (agent output streams to terminal)
uv run crt run --config crt-config.yaml --tasks crt-tasks.yaml
# Compare two runs for regressions
uv run crt compare --baseline out/baseline.json --current out/results.jsonAgents run with broad permissions — container isolation enforces least privilege over filesystem, user, and network access. See examples/sandbox/ for a complete setup with network allowlisting via a filtering proxy.
See USAGE.md for details.
| Command | Description |
|---|---|
crt init |
Scaffold starter configs by detecting context files in a repo |
crt run |
Run tasks under different context conditions and measure results |
crt compare |
Compare two result files and report regressions |
See USAGE.md for agent setup, configuration reference, and detailed workflows.