Skip to content

[Experiment] Caveman skill + AGENTS.md response-style block#642

Draft
t-prda wants to merge 5 commits into
mainfrom
exp/caveman-skill
Draft

[Experiment] Caveman skill + AGENTS.md response-style block#642
t-prda wants to merge 5 commits into
mainfrom
exp/caveman-skill

Conversation

@t-prda
Copy link
Copy Markdown
Collaborator

@t-prda t-prda commented May 12, 2026

Experiment Description

A/B test the impact of "caveman-speak" terse response style on agent token consumption (and resolution rate) without sacrificing technical accuracy.

Two reinforcement mechanisms wired in this branch (treatment arm):

  1. Custom instructions (always-on): appends a "Response Style" block to shared/instructions/<repo>/AGENTS.md in both dataset trees (microsoft-BCApps, microsoftInternal-NAV). The harness already renames AGENTS.md to CLAUDE.md / copilot-instructions.md per agent at runtime (instruction_operations.py), so this block is unconditionally loaded into every session.
  2. Skill (reinforcement): new shared/instructions/<repo>/skills/caveman/SKILL.md in both trees, copied into the testbed's .claude/skills/ (or .github/skills/ for Copilot) by skills_operations.py when skills.enabled is true.

Upstream: https://github.com/JuliusBrussee/caveman (MIT, LICENSE included).

Companion upstream skills (caveman-commit, caveman-help, caveman-review, compress) target commits / PR reviews / memory files and are not exercised by the BC-Bench runtime, so they are intentionally omitted. Hooks, statusline, and /caveman mode switching are also omitted — we want a fixed "full" intensity for reproducibility, not interactive mode switching.

Configuration Changes

  • Custom instructions (instructions.enabled: true)
  • Skills (skills.enabled: true)
  • Custom agents (agents.enabled: true, name: ___)
  • MCP servers (list below)
  • Other (describe)

Agent & Model

  • Agent: GitHub Copilot CLI and Claude Code
  • Model: claude-haiku-4.5 (default for first runs); follow-up sweep across claude-sonnet-4.6, claude-opus-4.7
  • Category: bug-fix (primary); test-generation as a secondary check

Hypothesis / Expected Outcome

Output tokens drop measurably (target: ≥20% reduction) on conversational/reasoning turns while resolution rate stays within noise of the baseline. Code written into files is explicitly excluded from the caveman transform, so patch quality should be unaffected. Risk: over-aggressive fragmenting causes the agent to skip steps or misread its own prior reasoning, hurting pass rate — the test-run gate should catch this before a full sweep.

Notes

  • Branch base: merged with latest main (post v0.5.3); pre-existing notebook ty-check failure in notebooks/dataset.ipynb from #637 is unrelated to this experiment.
  • Baseline arm (instructions enabled, no caveman block) lives on exp/caveman-baseline for direct A/B.
  • Draft PR — not intended to merge; serves as the entry point describing what is being evaluated, per EXPERIMENT.md.

t-prda and others added 3 commits April 20, 2026 17:54
Treatment branch for the caveman token-cost A/B. Two mechanisms wired:

1. Custom instructions (always-on): appends a terse "Response Style" section
   to shared/instructions/<repo>/AGENTS.md in both dataset trees. Harness
   already renames AGENTS.md to CLAUDE.md for Claude Code at run time
   (instruction_operations.py), so this block is unconditionally loaded into
   every session.

2. Skill (reinforcement): new shared/instructions/<repo>/skills/caveman/
   SKILL.md in both trees, copied into testbed .claude/skills/ by
   skills_operations.py when skills.enabled is true.

Config toggles in shared/config.yaml: instructions.enabled and skills.enabled
flipped from false to true.

Only skills/caveman/SKILL.md is included. Upstream's companion skills
(caveman-commit, caveman-help, caveman-review, compress) target commits / PR
reviews / memory files — not exercised by BC-Bench runtime. Hooks,
statusline, and /caveman mode switching are intentionally omitted — we want
"full" intensity fixed for reproducibility, not interactive mode switching.

Upstream: https://github.com/JuliusBrussee/caveman (MIT, LICENSE included).
Trim everything in the shared instruction tree to the bare minimum needed
for the caveman experiment, and shrink the dataset to first-party apps so
the local A/B loop runs in minutes rather than hours.

Instruction tree (microsoft-BCApps + microsoftInternal-NAV):
- AGENTS.md: drop the BC/AL overview prose; keep only the "Response Style"
  caveman block as the canonical instruction file. This is what the
  harness renames to copilot-instructions.md / CLAUDE.md and loads into
  the system prompt every turn (always-on enforcement).
- skills/: drop al-test-generation; keep only skills/caveman/.
- agents/: drop ALTest.agent.md (and the now-empty agents/ dir).
- instructions/ (NAV only): drop codeunits/pages/tables.instructions.md.

Local A/B (microsoft__BCApps-4699, claude-haiku-4.5, n=1) showed the
extra instruction surface area was a net loss — input-side overhead from
loading the skill files outweighed any output-side savings on a short
patch task. caveman-compress was a particularly bad fit since it never
fires during a bug-fix run; dropping it.

Dataset:
- Filter dataset/bcbench.jsonl to entries whose project_paths do NOT
  start with App\Layers\W1\BaseApp. 101 -> 16 entries, all under
  App\Apps\W1\<app> (Shopify, Sustainability, ExcelReports,
  SubscriptionBilling, etc.). BaseApp tasks pull in the entire base
  application and dominate eval wall time.

Tests:
- test_agent_skills.py: switch fixture from "al-test-generation" to
  "caveman" since that is now the only checked-in skill.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# - Claude: copies to repo/.claude/skills/
skills:
enabled: false
enabled: true
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unnecessary when instruction is set to true

- unrelated files should be removed (replace semantics)
"""
skills_source = _get_source_instructions_path("microsoftInternal/NAV") / "skills"
source_skill_dir = skills_source / "al-test-generation"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should address this in main branch, a real gap in test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants