Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
5169543
Add counterfactual dataset (255 entries + problem statements)
Apr 14, 2026
c1b37e5
Add counterfactual experiment notebooks (aligned with main, deps stub…
Apr 14, 2026
7e581c8
Add counterfactual-evaluation category
Apr 14, 2026
6bc0ea2
Add analysis package: family models, aggregator, metrics, annotation
Apr 14, 2026
67a9e97
Merge branch 'main' of https://github.com/microsoft/BC-Bench into the…
Apr 16, 2026
ea1584a
Split counterfactual-evaluation into per-variant categories (cf-1..cf-4)
Apr 16, 2026
9f036f2
Add cf-1..cf-4 to workflow category dropdowns
Apr 16, 2026
48785c1
Auto-detect counterfactual dataset path from InstanceId
Apr 16, 2026
c28de64
Merge branch 'main' of https://github.com/microsoft/BC-Bench into the…
Apr 18, 2026
fc6f4e3
Fix load base intances problem
Apr 18, 2026
7059b6f
fix matrix issue
Apr 18, 2026
d005880
fix container problems
Apr 18, 2026
fdf88ea
remove low quality dataset
Apr 18, 2026
51cba47
change validation action for dataset
Apr 18, 2026
48e5ed5
fix patch failed issue
Apr 18, 2026
37a48cf
update cf3 and cf4 datasets
Apr 20, 2026
9118019
remove images
Apr 20, 2026
459f0ad
Merge branch 'main' into thesis/counterfactual-dataset
Jiawen-CS Apr 20, 2026
a704657
fix cf1 and cf2
Apr 20, 2026
a526f91
Merge branch 'thesis/counterfactual-dataset' of https://github.com/mi…
Apr 20, 2026
d96eb2b
impovement for cf3
Apr 20, 2026
61a6569
update dataset
Apr 20, 2026
0e3bb7c
./
Apr 20, 2026
28279c1
Update cf2
Apr 21, 2026
cb62056
./cf2
Apr 21, 2026
05ad4b3
update dataset
Apr 21, 2026
2e26c36
fix: resolve CF COMPILE failures from run 24733495613
Apr 24, 2026
503998f
Fix 3 failing CF-3 counterfactual entries
Apr 24, 2026
0b4048f
fix(dataset): redesign 4 TEST_3b CF entries with real semantic mutations
Apr 24, 2026
c7c1244
update cf2 issue
Apr 24, 2026
79a7eff
Merge branch 'main' of https://github.com/microsoft/BC-Bench into the…
Apr 24, 2026
ff320f5
fix(dataset): auto-fix 19 CF PATCH hunk headers (run 24890066865)
Apr 24, 2026
fe3d767
Fix 14 CF test failures (11 TEST_3a + 3 TEST_3b)
Apr 24, 2026
ab8716f
Fix 7 failing CF-3 counterfactual entries
Apr 24, 2026
da8f504
fix(dataset): fix 22 CF-1 entries with COMPILE/TEST/PATCH failures
Apr 24, 2026
14c0f40
fix(dataset): combined cf-1 fixes + auto-fix 6 cf-2 PATCH hunk headers
Apr 25, 2026
b8c9b8c
fix(cf-2): fix 14 COMPILE failures in counterfactual dataset
Apr 25, 2026
c1d868e
fix(cf-2): fix 13 TEST_3a failures in counterfactual dataset
Apr 25, 2026
e4305cd
fix(dataset): convert 6 cf-2 entries with list-typed patch/test_patch…
Apr 25, 2026
496394b
fix(dataset): fix 10 CF entries with corrupt/mismatched patches
Apr 26, 2026
86bf0bc
Merge branch 'main' of https://github.com/microsoft/BC-Bench into the…
Apr 27, 2026
59e7e00
fix(dataset): remove 82 failed CF entries and renumber variants
Apr 28, 2026
40cbd60
Merge branch 'main' of https://github.com/microsoft/BC-Bench into the…
Apr 28, 2026
7788e43
Merge branch 'main' of https://github.com/microsoft/BC-Bench into the…
May 1, 2026
2dab94d
remove problem_statement_override
May 1, 2026
974bdc9
combine cfs as one
May 3, 2026
b7235e3
remove dataset
May 3, 2026
d83f67a
change readme.md
May 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .github/actions/setup-bc-container/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ runs:
# Mask the password in GitHub Actions logs
Write-Output "::add-mask::$password"

"BC_CONTAINER_NAME=bcbench-$("${{ inputs.instance-id }}".Split('-')[1])" | Out-File -FilePath $env:GITHUB_ENV -Append
"BC_CONTAINER_NAME=bcbench-$("${{ inputs.instance-id }}".Split('-')[1].Split('_')[0])" | Out-File -FilePath $env:GITHUB_ENV -Append
"BC_CONTAINER_USERNAME=admin" | Out-File -FilePath $env:GITHUB_ENV -Append
"BC_CONTAINER_PASSWORD=$password" | Out-File -FilePath $env:GITHUB_ENV -Append
shell: pwsh
Expand Down
2 changes: 1 addition & 1 deletion .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This is a benchmark for evaluating coding agents on real-world Business Central
- Uses `pre-commit` for code quality checks (ruff linting/formatting, trailing whitespace, etc.)

## Categories
BC-Bench is category-based and designed to grow over time. It currently has two categories, `bug-fix` and `test-generation`. They share the same dataset tasks and execution-based setup, but use different prompts, expected outputs, and evaluation pipelines. Future categories such as `code-review` can be added within the same overall benchmark structure, though they may require different inputs, setup, or evaluation methods.
BC-Bench is category-based and designed to grow over time. It currently has two execution-based categories that update the public leaderboard, `bug-fix` and `test-generation`, plus a single counterfactual category `cf` (entries from `counterfactual.jsonl`) that reuses the bug-fix pipeline but only saves raw results — its metrics are computed offline in notebooks. Future categories such as `code-review` can be added within the same overall benchmark structure, though they may require different inputs, setup, or evaluation methods.

## Coding Patterns and Guidelines

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ jobs:
id: random
shell: pwsh
run: |
$categories = @("bug-fix", "test-generation")
$categories = @("bug-fix", "test-generation", "cf")
$selected = $categories | Get-Random
echo "category=$selected" >> $env:GITHUB_OUTPUT

Expand Down
1 change: 1 addition & 0 deletions .github/workflows/claude-evaluation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ on:
options:
- "bug-fix"
- "test-generation"
- "cf"
test-run:
description: "Indicate this is a test run (with few entries)"
required: false
Expand Down
28 changes: 25 additions & 3 deletions .github/workflows/copilot-evaluation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ on:
options:
- "bug-fix"
- "test-generation"
- "cf"
test-run:
description: "Indicate this is a test run (with few entries)"
required: false
Expand All @@ -39,6 +40,11 @@ on:
required: false
default: false
type: boolean
entries-override:
description: "Optional JSON array of entry IDs to run (overrides get-entries). Example: [\"microsoftInternal__NAV-213524__cf-4\"]"
required: false
default: ""
type: string
repeat:
description: "Number of times to run sequentially (ignored for test runs)"
required: false
Expand All @@ -60,17 +66,33 @@ env:

jobs:
get-entries:
if: ${{ inputs.entries-override == '' }}
uses: ./.github/workflows/get-entries.yml
with:
test-run: ${{ inputs.test-run }}
category: ${{ inputs.category }}

resolve-entries:
runs-on: ubuntu-latest
needs: get-entries
if: always()
outputs:
entries: ${{ steps.pick.outputs.entries }}
steps:
- id: pick
run: |
if [[ -n "${{ inputs.entries-override }}" ]]; then
echo "entries=${{ inputs.entries-override }}" >> "$GITHUB_OUTPUT"
else
echo 'entries=${{ needs.get-entries.outputs.entries }}' >> "$GITHUB_OUTPUT"
fi

evaluate-with-copilot-cli:
runs-on: [ GitHub-BCBench ]
needs: get-entries
needs: resolve-entries
outputs:
results-dir: ${{ env.EVALUATION_RESULTS_DIR }}
if: needs.get-entries.outputs.entries != '[]'
if: needs.resolve-entries.outputs.entries != '[]' && needs.resolve-entries.outputs.entries != ''
environment:
name: ado-read
deployment: false
Expand All @@ -82,7 +104,7 @@ jobs:
fail-fast: false
max-parallel: 32
matrix:
entry: ${{ fromJson(needs.get-entries.outputs.entries) }}
entry: ${{ fromJson(needs.resolve-entries.outputs.entries) }}
steps:
- name: Checkout repository
uses: actions/checkout@v5
Expand Down
37 changes: 33 additions & 4 deletions .github/workflows/dataset-validation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,19 @@ on:
required: false
default: true
type: boolean
category:
description: "Dataset category to validate"
required: false
default: "bug-fix"
type: choice
options:
- "bug-fix"
- "cf"
entries-override:
description: "Optional JSON array of entry IDs to validate (overrides get-entries)"
required: false
default: ""
type: string
modified-only:
description: "Only verify modified entries"
required: false
Expand All @@ -20,16 +33,32 @@ on:

jobs:
get-entries:
if: ${{ inputs.entries-override == '' || inputs.entries-override == null }}
uses: ./.github/workflows/get-entries.yml
with:
modified-only: ${{ inputs.modified-only || false }}
test-run: ${{ inputs.test-run || false }}
category: "bug-fix"
category: ${{ inputs.category || 'bug-fix' }}

resolve-entries:
runs-on: ubuntu-latest
needs: get-entries
if: always()
outputs:
entries: ${{ steps.pick.outputs.entries }}
steps:
- id: pick
run: |
if [[ -n "${{ inputs.entries-override }}" ]]; then
echo "entries=${{ inputs.entries-override }}" >> "$GITHUB_OUTPUT"
else
echo 'entries=${{ needs.get-entries.outputs.entries }}' >> "$GITHUB_OUTPUT"
fi

verify-build-and-tests:
runs-on: [GitHub-BCBench]
needs: get-entries
if: needs.get-entries.outputs.entries != '[]'
needs: resolve-entries
if: needs.resolve-entries.outputs.entries != '[]' && needs.resolve-entries.outputs.entries != ''
environment:
name: ado-read
deployment: false
Expand All @@ -40,7 +69,7 @@ jobs:
strategy:
fail-fast: false
matrix:
entry: ${{ fromJson(needs.get-entries.outputs.entries) }}
entry: ${{ fromJson(needs.resolve-entries.outputs.entries) }}
steps:
- name: Checkout repository
uses: actions/checkout@v5
Expand Down
7 changes: 6 additions & 1 deletion .github/workflows/summarize-results.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,13 +64,15 @@ jobs:
retention-days: ${{ inputs.mock && 1 || 30 }}

- name: Azure Login with OIDC for bceval package
if: ${{ inputs.category != 'cf' }}
uses: azure/login@v3
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
allow-no-subscriptions: true

- name: Upload result with bceval to Braintrust
if: ${{ inputs.category != 'cf' }}
env:
BRAINTRUST_PROJECT_ID: ${{ secrets.BRAINTRUST_PROJECT_ID }}
BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
Expand All @@ -95,7 +97,10 @@ jobs:
--tags "${{ inputs.agent }},${MODEL_TAG},${{ inputs.category }}" ${{ !inputs.mock && '--upload-results' || '' }}

- name: Update leaderboard in a new branch
if: ${{ !inputs.mock }}
# CF runs only save raw results (per-instance result files + summary artifact);
# they do not contribute to the public leaderboard. Metrics for CF are
# computed offline from the raw results in notebooks.
if: ${{ !inputs.mock && inputs.category != 'cf' }}
run: |
git fetch origin main

Expand Down
120 changes: 120 additions & 0 deletions COUNTERFACTUAL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Counterfactual Evaluation

This document describes the **counterfactual (CF)** categories in BC-Bench.

## What Are Counterfactual Entries?

A counterfactual (CF) entry is a **variant** of an existing base bug-fix entry. It reuses the same repository state (`repo`, `base_commit`, `project_paths`) but provides a **different fix and test pair** — testing whether an agent can solve a related-but-different version of the same bug.

Each CF entry lives in [`dataset/counterfactual.jsonl`](dataset/counterfactual.jsonl) and references a base entry from [`dataset/bcbench.jsonl`](dataset/bcbench.jsonl).

### Validation Contract

CF entries follow the **exact same** validation contract as bug-fix:

- **Before** applying the patch → `FAIL_TO_PASS` tests **FAIL**
- **After** applying the patch → `FAIL_TO_PASS` tests **PASS**

This means the same `BugFixPipeline` is reused for evaluation.

### Naming Convention

```
microsoftInternal__NAV-210528 ← base entry (in bcbench.jsonl)
microsoftInternal__NAV-210528__cf-1 ← first counterfactual variant
microsoftInternal__NAV-210528__cf-2 ← second variant
```

## The `cf` Category

All CF entries are exposed under a single category named `cf`. The base bug-fix
evaluation pipeline is reused as-is. CF runs in CI **only save raw results**
(per-instance result JSONL files + an aggregated `evaluation_summary.json`
artifact); they do not push to the public leaderboard or upload to Braintrust.
Downstream metrics (resolution rate, family fragility, severity, stability,
etc.) are computed offline from the raw results in the analysis notebooks.

Variant numbering still lives in the instance ID (`<base_id>__cf-<N>`), so any
per-variant analysis remains possible by grouping on the `__cf-N` suffix.

All CF entries share:
- **Dataset file**: `counterfactual.jsonl`
- **Entry class**: `CounterfactualEntry`
- **Pipeline**: `BugFixPipeline` (reused)
- **Result class**: `BugFixResult` (reused)
- **Prompt template**: `counterfactual-template`

## CF Entry Schema

Each line in `counterfactual.jsonl` contains:

| Field | Description |
| ---------------------------- | ------------------------------------------------- |
| `instance_id` | `<base_id>__cf-<N>` — unique identifier |
| `base_instance_id` | ID of the base entry this variant is derived from |
| `variant_description` | Human-readable description of the variant |
| `failure_layer` | Optional failure layer classification |
| `problem_statement_override` | Path to the CF-specific problem statement |
| `patch` | The counterfactual fix patch |
| `test_patch` | The counterfactual test patch |
| `FAIL_TO_PASS` | Tests that must fail before fix, pass after |
| `PASS_TO_PASS` | Tests that must pass both before and after |

**Note:** Fields like `repo`, `base_commit`, `project_paths`, `environment_setup_version`, and `created_at` are **not stored** in CF entries. They are resolved at load time from the base entry in `bcbench.jsonl`.

## Architecture

The counterfactual categories integrate into BC-Bench's polymorphic category system:

| Extension Point | Value |
|---|---|
| `EvaluationCategory` | `CF = "cf"` |
| `is_counterfactual` | `True` only for `CF` |
| `prompt_template_key` | `"counterfactual"` |
| `dataset_path` | `dataset/counterfactual.jsonl` |
| `entry_class` | `CounterfactualEntry` (resolves base fields at load time) |
| `pipeline` | `BugFixPipeline` (reused — same FAIL→PASS contract) |
| `result_class` | `BugFixResult` (reused) |
| `summary_class` | `ExecutionBasedEvaluationResultSummary` (reused) |

### Key File Reference

| File | Purpose |
|---|---|
| [`dataset/counterfactual.jsonl`](dataset/counterfactual.jsonl) | All CF entries (one JSON per line) |
| [`dataset/problemstatement/<id>/`](dataset/problemstatement/) | Problem statement for each CF entry |
| [`src/bcbench/dataset/counterfactual_entry.py`](src/bcbench/dataset/counterfactual_entry.py) | CF entry model with base resolution |
| [`src/bcbench/types.py`](src/bcbench/types.py) | Category registration (`CF`) |

## CLI Usage

```bash
# List all CF entries
uv run bcbench dataset list --category cf

# List a few random CF entries (test run)
uv run bcbench dataset list --category cf --test-run

# View a specific CF entry
uv run bcbench dataset view microsoftInternal__NAV-210528__cf-1 --category cf

# Run agent on a CF entry
uv run bcbench run copilot microsoftInternal__NAV-210528__cf-1 \
--category cf \
--repo-path /path/to/NAV

# Full evaluation (build + test)
uv run bcbench evaluate copilot microsoftInternal__NAV-210528__cf-1 \
--category cf \
--repo-path /path/to/NAV
```

## Analysis Notebooks

Experiment notebooks for analyzing CF results are in [`notebooks/counterfactual-evaluation/`](notebooks/counterfactual-evaluation/):

| Notebook | Purpose |
|---|---|
| `experiment1-base-performance.ipynb` | Instance-level compile/pass rates per model |
| `experiment2-cf-sensitivity.ipynb` | Family fragility rate, severity, pattern analysis |
| `experiment3-layered-failure.ipynb` | L1-L5 failure distribution, layer-conditioned fragility |
Loading
Loading