[Experiment] Caveman skill + AGENTS.md response-style block#642
Draft
t-prda wants to merge 5 commits into
Draft
Conversation
Treatment branch for the caveman token-cost A/B. Two mechanisms wired: 1. Custom instructions (always-on): appends a terse "Response Style" section to shared/instructions/<repo>/AGENTS.md in both dataset trees. Harness already renames AGENTS.md to CLAUDE.md for Claude Code at run time (instruction_operations.py), so this block is unconditionally loaded into every session. 2. Skill (reinforcement): new shared/instructions/<repo>/skills/caveman/ SKILL.md in both trees, copied into testbed .claude/skills/ by skills_operations.py when skills.enabled is true. Config toggles in shared/config.yaml: instructions.enabled and skills.enabled flipped from false to true. Only skills/caveman/SKILL.md is included. Upstream's companion skills (caveman-commit, caveman-help, caveman-review, compress) target commits / PR reviews / memory files — not exercised by BC-Bench runtime. Hooks, statusline, and /caveman mode switching are intentionally omitted — we want "full" intensity fixed for reproducibility, not interactive mode switching. Upstream: https://github.com/JuliusBrussee/caveman (MIT, LICENSE included).
Trim everything in the shared instruction tree to the bare minimum needed for the caveman experiment, and shrink the dataset to first-party apps so the local A/B loop runs in minutes rather than hours. Instruction tree (microsoft-BCApps + microsoftInternal-NAV): - AGENTS.md: drop the BC/AL overview prose; keep only the "Response Style" caveman block as the canonical instruction file. This is what the harness renames to copilot-instructions.md / CLAUDE.md and loads into the system prompt every turn (always-on enforcement). - skills/: drop al-test-generation; keep only skills/caveman/. - agents/: drop ALTest.agent.md (and the now-empty agents/ dir). - instructions/ (NAV only): drop codeunits/pages/tables.instructions.md. Local A/B (microsoft__BCApps-4699, claude-haiku-4.5, n=1) showed the extra instruction surface area was a net loss — input-side overhead from loading the skill files outweighed any output-side savings on a short patch task. caveman-compress was a particularly bad fit since it never fires during a bug-fix run; dropping it. Dataset: - Filter dataset/bcbench.jsonl to entries whose project_paths do NOT start with App\Layers\W1\BaseApp. 101 -> 16 entries, all under App\Apps\W1\<app> (Shopify, Sustainability, ExcelReports, SubscriptionBilling, etc.). BaseApp tasks pull in the entire base application and dominate eval wall time. Tests: - test_agent_skills.py: switch fixture from "al-test-generation" to "caveman" since that is now the only checked-in skill. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
haoranpb
reviewed
May 13, 2026
| # - Claude: copies to repo/.claude/skills/ | ||
| skills: | ||
| enabled: false | ||
| enabled: true |
Collaborator
There was a problem hiding this comment.
this is unnecessary when instruction is set to true
| - unrelated files should be removed (replace semantics) | ||
| """ | ||
| skills_source = _get_source_instructions_path("microsoftInternal/NAV") / "skills" | ||
| source_skill_dir = skills_source / "al-test-generation" |
Collaborator
There was a problem hiding this comment.
we should address this in main branch, a real gap in test
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Experiment Description
A/B test the impact of "caveman-speak" terse response style on agent token consumption (and resolution rate) without sacrificing technical accuracy.
Two reinforcement mechanisms wired in this branch (treatment arm):
shared/instructions/<repo>/AGENTS.mdin both dataset trees (microsoft-BCApps,microsoftInternal-NAV). The harness already renamesAGENTS.mdtoCLAUDE.md/copilot-instructions.mdper agent at runtime (instruction_operations.py), so this block is unconditionally loaded into every session.shared/instructions/<repo>/skills/caveman/SKILL.mdin both trees, copied into the testbed's.claude/skills/(or.github/skills/for Copilot) byskills_operations.pywhenskills.enabledis true.Upstream: https://github.com/JuliusBrussee/caveman (MIT, LICENSE included).
Companion upstream skills (
caveman-commit,caveman-help,caveman-review,compress) target commits / PR reviews / memory files and are not exercised by the BC-Bench runtime, so they are intentionally omitted. Hooks, statusline, and/cavemanmode switching are also omitted — we want a fixed "full" intensity for reproducibility, not interactive mode switching.Configuration Changes
instructions.enabled: true)skills.enabled: true)agents.enabled: true, name: ___)Agent & Model
claude-haiku-4.5(default for first runs); follow-up sweep acrossclaude-sonnet-4.6,claude-opus-4.7bug-fix(primary);test-generationas a secondary checkHypothesis / Expected Outcome
Output tokens drop measurably (target: ≥20% reduction) on conversational/reasoning turns while resolution rate stays within noise of the baseline. Code written into files is explicitly excluded from the caveman transform, so patch quality should be unaffected. Risk: over-aggressive fragmenting causes the agent to skip steps or misread its own prior reasoning, hurting pass rate — the test-run gate should catch this before a full sweep.
Notes
main(postv0.5.3); pre-existing notebook ty-check failure innotebooks/dataset.ipynbfrom#637is unrelated to this experiment.exp/caveman-baselinefor direct A/B.EXPERIMENT.md.