[Experiment] Caveman skill + AGENTS.md response-style block by t-prda · Pull Request #642 · microsoft/BC-Bench

t-prda · 2026-05-12T14:49:24Z

Experiment Description

A/B test the impact of "caveman-speak" terse response style on agent token consumption (and resolution rate) without sacrificing technical accuracy.

Two reinforcement mechanisms wired in this branch (treatment arm):

Custom instructions (always-on): appends a "Response Style" block to shared/instructions/<repo>/AGENTS.md in both dataset trees (microsoft-BCApps, microsoftInternal-NAV). The harness already renames AGENTS.md to CLAUDE.md / copilot-instructions.md per agent at runtime (instruction_operations.py), so this block is unconditionally loaded into every session.
Skill (reinforcement): new shared/instructions/<repo>/skills/caveman/SKILL.md in both trees, copied into the testbed's .claude/skills/ (or .github/skills/ for Copilot) by skills_operations.py when skills.enabled is true.

Upstream: https://github.com/JuliusBrussee/caveman (MIT, LICENSE included).

Companion upstream skills (caveman-commit, caveman-help, caveman-review, compress) target commits / PR reviews / memory files and are not exercised by the BC-Bench runtime, so they are intentionally omitted. Hooks, statusline, and /caveman mode switching are also omitted — we want a fixed "full" intensity for reproducibility, not interactive mode switching.

Configuration Changes

Custom instructions (instructions.enabled: true)
Skills (skills.enabled: true)
Custom agents (agents.enabled: true, name: ___)
MCP servers (list below)
Other (describe)

Agent & Model

Agent: GitHub Copilot CLI and Claude Code
Model: claude-haiku-4.5 (default for first runs); follow-up sweep across claude-sonnet-4.6, claude-opus-4.7
Category: bug-fix (primary); test-generation as a secondary check

Hypothesis / Expected Outcome

Output tokens drop measurably (target: ≥20% reduction) on conversational/reasoning turns while resolution rate stays within noise of the baseline. Code written into files is explicitly excluded from the caveman transform, so patch quality should be unaffected. Risk: over-aggressive fragmenting causes the agent to skip steps or misread its own prior reasoning, hurting pass rate — the test-run gate should catch this before a full sweep.

Notes

Branch base: merged with latest main (post v0.5.3); pre-existing notebook ty-check failure in notebooks/dataset.ipynb from #637 is unrelated to this experiment.
Baseline arm (instructions enabled, no caveman block) lives on exp/caveman-baseline for direct A/B.
Draft PR — not intended to merge; serves as the entry point describing what is being evaluated, per EXPERIMENT.md.

Treatment branch for the caveman token-cost A/B. Two mechanisms wired: 1. Custom instructions (always-on): appends a terse "Response Style" section to shared/instructions/<repo>/AGENTS.md in both dataset trees. Harness already renames AGENTS.md to CLAUDE.md for Claude Code at run time (instruction_operations.py), so this block is unconditionally loaded into every session. 2. Skill (reinforcement): new shared/instructions/<repo>/skills/caveman/ SKILL.md in both trees, copied into testbed .claude/skills/ by skills_operations.py when skills.enabled is true. Config toggles in shared/config.yaml: instructions.enabled and skills.enabled flipped from false to true. Only skills/caveman/SKILL.md is included. Upstream's companion skills (caveman-commit, caveman-help, caveman-review, compress) target commits / PR reviews / memory files — not exercised by BC-Bench runtime. Hooks, statusline, and /caveman mode switching are intentionally omitted — we want "full" intensity fixed for reproducibility, not interactive mode switching. Upstream: https://github.com/JuliusBrussee/caveman (MIT, LICENSE included).

Trim everything in the shared instruction tree to the bare minimum needed for the caveman experiment, and shrink the dataset to first-party apps so the local A/B loop runs in minutes rather than hours. Instruction tree (microsoft-BCApps + microsoftInternal-NAV): - AGENTS.md: drop the BC/AL overview prose; keep only the "Response Style" caveman block as the canonical instruction file. This is what the harness renames to copilot-instructions.md / CLAUDE.md and loads into the system prompt every turn (always-on enforcement). - skills/: drop al-test-generation; keep only skills/caveman/. - agents/: drop ALTest.agent.md (and the now-empty agents/ dir). - instructions/ (NAV only): drop codeunits/pages/tables.instructions.md. Local A/B (microsoft__BCApps-4699, claude-haiku-4.5, n=1) showed the extra instruction surface area was a net loss — input-side overhead from loading the skill files outweighed any output-side savings on a short patch task. caveman-compress was a particularly bad fit since it never fires during a bug-fix run; dropping it. Dataset: - Filter dataset/bcbench.jsonl to entries whose project_paths do NOT start with App\Layers\W1\BaseApp. 101 -> 16 entries, all under App\Apps\W1\<app> (Shopify, Sustainability, ExcelReports, SubscriptionBilling, etc.). BaseApp tasks pull in the entire base application and dominate eval wall time. Tests: - test_agent_skills.py: switch fixture from "al-test-generation" to "caveman" since that is now the only checked-in skill. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

haoranpb · 2026-05-13T13:51:57Z

 #    - Claude: copies to repo/.claude/skills/
 skills:
-  enabled: false
+  enabled: true


this is unnecessary when instruction is set to true

haoranpb · 2026-05-13T13:52:28Z

    - unrelated files should be removed (replace semantics)
    """
    skills_source = _get_source_instructions_path("microsoftInternal/NAV") / "skills"
-    source_skill_dir = skills_source / "al-test-generation"


we should address this in main branch, a real gap in test

…/caveman-skill

t-prda and others added 3 commits April 20, 2026 17:54

Merge branch 'main' into exp/caveman-skill

cf81373

haoranpb reviewed May 13, 2026

View reviewed changes

haoranpb added 2 commits May 21, 2026 14:41

Merge branch 'main' of https://github.com/microsoft/BC-Bench into exp…

9c5829d

…/caveman-skill

run on the fule dataset

1a552b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experiment] Caveman skill + AGENTS.md response-style block#642

[Experiment] Caveman skill + AGENTS.md response-style block#642
t-prda wants to merge 5 commits into
mainfrom
exp/caveman-skill

t-prda commented May 12, 2026

Uh oh!

haoranpb May 13, 2026

Uh oh!

haoranpb May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

t-prda commented May 12, 2026

Experiment Description

Configuration Changes

Agent & Model

Hypothesis / Expected Outcome

Notes

Uh oh!

haoranpb May 13, 2026

Choose a reason for hiding this comment

Uh oh!

haoranpb May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants