Skip to content

smith-xyz/context-reliability-testing

Repository files navigation

context-reliability-testing

TDD for context files — measure whether your AGENTS.md, CLAUDE.md, and .cursorrules actually help on your repo.

What it does

Runs the same coding tasks under different context conditions (no context vs. full AGENTS.md vs. custom variants) and compares pass rate, token usage, wall time, and tool calls.

How it helps

Same task, same agent, only context differs. That isolates context as the variable so you can see what actually moves the needle vs. what's noise.

References: Gloaguen et al., Zhang et al., Lulla et al., Sharma.

How it works

flowchart TD
    subgraph you [You provide]
        tasks([Tasks / Eval Set])
        context([Context Files])
        agent([Your Agent CLI])
        config([Run Config])
    end

    subgraph crt [CRT handles]
        worktree{{Git Worktree Isolation}}
        cond_a[\No context/]
        cond_b[\Full AGENTS.md/]
        cond_c[\Custom variant/]
        exec{{Agent Execution}}
        accept{{Acceptance Check}}
    end

    subgraph results [You get back]
        metrics[Metrics per condition]
        comparison>Comparison Report]
    end

    config --> worktree
    tasks --> worktree
    context --> worktree
    worktree --> cond_a & cond_b & cond_c
    cond_a & cond_b & cond_c --> exec
    agent --> exec
    exec --> accept
    accept --> metrics
    metrics --> comparison
Loading

Quick start

uv sync --extra dev

# Scaffold configs from your repo (detects AGENTS.md, .cursorrules, etc.)
uv run crt init . --test-cmd "pytest -x"

# Dry run (show work scope, no agents invoked)
uv run crt run --config crt-config.yaml --tasks crt-tasks.yaml --dry-run

# Real run (agent output streams to terminal)
uv run crt run --config crt-config.yaml --tasks crt-tasks.yaml

# Compare two runs for regressions
uv run crt compare --baseline out/baseline.json --current out/results.json

Container usage

Agents run with broad permissions — container isolation enforces least privilege over filesystem, user, and network access. See examples/sandbox/ for a complete setup with network allowlisting via a filtering proxy.

See USAGE.md for details.

Commands

Command Description
crt init Scaffold starter configs by detecting context files in a repo
crt run Run tasks under different context conditions and measure results
crt compare Compare two result files and report regressions

See USAGE.md for agent setup, configuration reference, and detailed workflows.

About

TDD for agent context files

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors