Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
db388bd
few more udpates for new categories
haoranpb Apr 8, 2026
57c004e
Refactor evaluation and dataset operations for improved workspace setup
haoranpb Apr 9, 2026
8e2f216
enable skipping container setup in action
haoranpb Apr 9, 2026
69a8db8
fix missing implementation for MockEvaluationPipeline
haoranpb Apr 9, 2026
7549d92
Refactor evaluation result classes to be more generic
haoranpb Apr 11, 2026
f32dd00
Merge branch 'main' into fix/more-ready-for-categories
haoranpb Apr 12, 2026
a4089b9
Improve readabilty of GitHub Action summary
haoranpb Apr 12, 2026
99af6b2
fix failing tests
haoranpb Apr 12, 2026
e1b0b93
Code Review POC
haoranpb Apr 12, 2026
1a68d78
Merge branch 'main' into category/code-review
haoranpb Apr 13, 2026
3ec10a0
fix merge conflict resolution mistake
haoranpb Apr 13, 2026
4e52832
Merge branch 'main' into category/code-review
haoranpb Apr 13, 2026
a9f59d9
Make container parameters optional in evaluate and run commands
haoranpb Apr 13, 2026
065e1aa
Merge branch 'category/code-review' of https://github.com/microsoft/B…
haoranpb Apr 13, 2026
4ad4bd9
Enhance code review functionality by adding expected review comments …
haoranpb Apr 13, 2026
92951c4
better hanlding container for not required categories
haoranpb Apr 13, 2026
7902610
Merge branch 'main' into category/code-review
haoranpb Apr 20, 2026
dad9289
Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…
haoranpb May 5, 2026
f1c4894
Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…
haoranpb May 11, 2026
aa48a29
prefer copilot.exe executable
haoranpb May 12, 2026
a244503
Normalize code-review dataset and preserve eval outputs
WaelAbuSeada May 16, 2026
9f6c353
Fix code-review branch setup and workflow wiring
WaelAbuSeada May 20, 2026
1a58e44
Require review.json and add log-based recovery fallback
WaelAbuSeada May 20, 2026
d0e8076
Harden code-review prompt for Windows copilot.cmd parsing
WaelAbuSeada May 20, 2026
0b764ef
Experiment: use al-code-review skill template
WaelAbuSeada May 20, 2026
83a4b28
Add skip-container-setup option to evaluation workflows
WaelAbuSeada May 20, 2026
394d005
Fix codereview lint issues in pipeline helpers
WaelAbuSeada May 20, 2026
ae0f1d2
Revert "Fix codereview lint issues in pipeline helpers"
WaelAbuSeada May 20, 2026
143075b
Expand code-review detailed table metrics
WaelAbuSeada May 21, 2026
7411246
Expand code-review detailed table metrics
WaelAbuSeada May 21, 2026
0d6e7ad
Update config and container setup action
WaelAbuSeada May 21, 2026
c7131a4
Update config and container setup action
WaelAbuSeada May 21, 2026
213ce7f
Remove unused apply_patch import from code-review evaluate
WaelAbuSeada May 21, 2026
2e1ced0
Refactor code-review metrics into pipeline and split comment display …
WaelAbuSeada May 21, 2026
b9babe6
Merge category/code-review into experiment/code-review-al-skill
WaelAbuSeada May 21, 2026
558d8ad
Normalize code-review test-run instance IDs to valid pattern
WaelAbuSeada May 21, 2026
be4ccd9
Use plain code-review IDs (security_001 style) and relax ID pattern
WaelAbuSeada May 21, 2026
9b4e5b1
Revert instance_id regex to original strict pattern
WaelAbuSeada May 21, 2026
32e499b
Rename code-review test IDs to strict non-vsoadmin format
WaelAbuSeada May 21, 2026
54c618f
fix: add dataset-path input to setup-bc-container action
WaelAbuSeada May 21, 2026
05673bc
feat: add precision and recall to detailed results table
WaelAbuSeada May 21, 2026
db9c805
fix: apply pre-commit lint and typing fixes
WaelAbuSeada May 21, 2026
7b6f871
chore: remove UI instruction file
WaelAbuSeada May 21, 2026
b05a635
fix: review code changes from applied entry patch
WaelAbuSeada May 21, 2026
0e11f9d
fix: support simplified code-review patch materialization
WaelAbuSeada May 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .github/actions/setup-bc-container/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ inputs:
github-token:
description: GitHub token for accessing public repositories
required: true
dataset-path:
description: Path to the dataset file
required: false
default: "dataset/bcbench.jsonl"
skip-container:
description: Skip BC container setup (only clone repository)
required: false
Expand Down Expand Up @@ -66,5 +70,5 @@ runs:
$env:ADO_TOKEN = az account get-access-token --resource "499b84ac-1321-427f-aa17-267ca6975798" --query accessToken -o tsv
Write-Output "::add-mask::$env:ADO_TOKEN"

.\scripts\Setup-ContainerAndRepository.ps1 -InstanceId "${{ inputs.instance-id }}" ${{ inputs.skip-container == 'true' && '-SkipContainer' || '' }}
.\scripts\Setup-ContainerAndRepository.ps1 -InstanceId "${{ inputs.instance-id }}" -DatasetPath "${{ inputs.dataset-path }}" ${{ inputs.skip-container == 'true' && '-SkipContainer' || '' }}
shell: pwsh
2 changes: 1 addition & 1 deletion .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ jobs:
id: random
shell: pwsh
run: |
$categories = @("bug-fix", "test-generation")
$categories = @("bug-fix", "test-generation", "code-review")
$selected = $categories | Get-Random
echo "category=$selected" >> $env:GITHUB_OUTPUT
Expand Down
10 changes: 9 additions & 1 deletion .github/workflows/claude-evaluation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ on:
options:
- "bug-fix"
- "test-generation"
- "code-review"
test-run:
description: "Indicate this is a test run (with few entries)"
required: false
Expand All @@ -33,6 +34,11 @@ on:
required: false
default: false
type: boolean
skip-container-setup:
description: "Skip BC container setup (repository setup still runs)"
required: false
default: false
type: boolean
repeat:
description: "Number of times to run sequentially (ignored for test runs)"
required: false
Expand Down Expand Up @@ -90,6 +96,8 @@ jobs:
azure-client-id: ${{ secrets.AZURE_CLIENT_ID }}
azure-tenant-id: ${{ secrets.AZURE_TENANT_ID }}
github-token: ${{ secrets.GITHUB_TOKEN }}
dataset-path: ${{ inputs.category == 'code-review' && 'dataset/codereview.jsonl' || 'dataset/bcbench.jsonl' }}
skip-container: ${{ inputs.category == 'code-review' || inputs.skip-container-setup }}

- name: Setup Python with UV
uses: ./.github/actions/setup-python-uv
Expand Down Expand Up @@ -154,4 +162,4 @@ jobs:
workflow-file: claude-evaluation.yml
repeat: ${{ inputs.repeat }}
workflow-inputs: |
{"model": "${{ inputs.model }}", "category": "${{ inputs.category }}", "test-run": "${{ inputs.test-run }}", "al-mcp": "${{ inputs.al-mcp }}"}
{"model": "${{ inputs.model }}", "category": "${{ inputs.category }}", "test-run": "${{ inputs.test-run }}", "al-mcp": "${{ inputs.al-mcp }}", "skip-container-setup": "${{ inputs.skip-container-setup }}"}
10 changes: 9 additions & 1 deletion .github/workflows/copilot-evaluation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ on:
options:
- "bug-fix"
- "test-generation"
- "code-review"
test-run:
description: "Indicate this is a test run (with few entries)"
required: false
Expand All @@ -40,6 +41,11 @@ on:
required: false
default: false
type: boolean
skip-container-setup:
description: "Skip BC container setup (repository setup still runs)"
required: false
default: false
type: boolean
repeat:
description: "Number of times to run sequentially (ignored for test runs)"
required: false
Expand Down Expand Up @@ -97,6 +103,8 @@ jobs:
azure-client-id: ${{ secrets.AZURE_CLIENT_ID }}
azure-tenant-id: ${{ secrets.AZURE_TENANT_ID }}
github-token: ${{ secrets.GITHUB_TOKEN }}
dataset-path: ${{ inputs.category == 'code-review' && 'dataset/codereview.jsonl' || 'dataset/bcbench.jsonl' }}
skip-container: ${{ inputs.category == 'code-review' || inputs.skip-container-setup }}

- name: Setup Python with UV
uses: ./.github/actions/setup-python-uv
Expand Down Expand Up @@ -170,4 +178,4 @@ jobs:
workflow-file: copilot-evaluation.yml
repeat: ${{ inputs.repeat }}
workflow-inputs: |
{"model": "${{ inputs.model }}", "category": "${{ inputs.category }}", "test-run": "${{ inputs.test-run }}", "al-mcp": "${{ inputs.al-mcp }}"}
{"model": "${{ inputs.model }}", "category": "${{ inputs.category }}", "test-run": "${{ inputs.test-run }}", "al-mcp": "${{ inputs.al-mcp }}", "skip-container-setup": "${{ inputs.skip-container-setup }}"}
10 changes: 9 additions & 1 deletion .github/workflows/summarize-results.yml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,14 @@ jobs:
BRAINTRUST_PROJECT_ID: ${{ secrets.BRAINTRUST_PROJECT_ID }}
BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
run: |
if [ "${{ inputs.category }}" = "code-review" ]; then
EVALUATORS="precision_score,recall_score,f1_score,valid_review_output"
elif [ "${{ inputs.category }}" = "test-generation" ]; then
EVALUATORS="resolution_rate,build_rate,pre_patch_failed_rate,post_patch_passed_rate"
else
EVALUATORS="resolution_rate,build_rate"
fi

# Get Azure DevOps access token from Azure CLI (uses the OIDC token from azure/login)
ADO_TOKEN=$(az account get-access-token --resource 499b84ac-1321-427f-aa17-267ca6975798 --query accessToken -o tsv)
echo "::add-mask::$ADO_TOKEN"
Expand All @@ -88,7 +96,7 @@ jobs:
bceval metrics calculate \
--input-file "${{ inputs.results-dir }}/${{ github.run_id }}/${{ env.BCEVAL_RESULT_FILE }}" \
--evaluator-definitions "${{ github.workspace }}/evaluator/scores.py" \
--evaluators "resolution_rate,build_rate${{ inputs.category == 'test-generation' && ',pre_patch_failed_rate,post_patch_passed_rate' || '' }}" \
--evaluators "${EVALUATORS}" \
--metric-definitions "${{ github.workspace }}/evaluator/metrics.py" \
--metrics "bc_bench_metrics" \
--eval-run-name "${{ inputs.agent }} (${{ inputs.model }}) - #${{ github.run_id }}" \
Expand Down
82 changes: 82 additions & 0 deletions dataset/codereview.jsonl

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions docs/_data/code-review.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"runs": [],
"aggregate": []
}
20 changes: 20 additions & 0 deletions evaluator/scores.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,23 @@ def __call__(self, *, metadata: dict, **kwargs: object) -> bool:
class PostPatchPassedRate:
def __call__(self, *, metadata: dict, **kwargs: object) -> bool:
return metadata.get("post_patch_passed", False)


class PrecisionScore:
def __call__(self, *, metadata: dict, **kwargs: object) -> float:
return float(metadata.get("precision", 0.0))


class RecallScore:
def __call__(self, *, metadata: dict, **kwargs: object) -> float:
return float(metadata.get("recall", 0.0))


class F1Score:
def __call__(self, *, metadata: dict, **kwargs: object) -> float:
return float(metadata.get("f1", 0.0))


class ValidReviewOutput:
def __call__(self, *, metadata: dict, **kwargs: object) -> bool:
return bool(metadata.get("valid_review_output", False))
4 changes: 3 additions & 1 deletion src/bcbench/agent/copilot/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,9 @@ def run_copilot_agent(
logger.info(f"Executing Copilot CLI in directory: {repo_path}")
logger.debug(f"Using prompt:\n{prompt}")

copilot_cmd = shutil.which("copilot.cmd") or shutil.which("copilot")
# Prefer copilot.exe over copilot.bat/copilot.cmd shims on Windows: the .bat shim invokes PowerShell,
# which re-parses arguments and corrupts prompts containing double quotes (e.g. JSON examples).
copilot_cmd = shutil.which("copilot.exe") or shutil.which("copilot.cmd") or shutil.which("copilot")
if not copilot_cmd:
raise AgentError("Copilot CLI not found in PATH. Please ensure it is installed and available.")

Expand Down
11 changes: 9 additions & 2 deletions src/bcbench/agent/shared/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,13 @@ prompt:
{{task}}
{% endif %}

code-review-template: |
@al-code-review

Review the current branch changes and return findings using the al-code-review skill schema.
Save the findings JSON to a file named review.json in the repository root.
If there are no findings, write an empty findings list.

# controls:
# 1. whether to copy custom instructions from `src/bcbench/agent/shared/instructions/<sanitized-repo>/`
# - Copilot: copies to repo/.github/ and renames AGENTS.md to copilot-instructions.md
Expand All @@ -63,14 +70,14 @@ prompt:
# NOTE: the canonical source file is AGENTS.md; it is automatically renamed
# to the agent-specific filename (AgentType.instruction_filename) during setup
instructions:
enabled: false
enabled: true

# controls:
# 1. whether to copy skills from `src/bcbench/agent/shared/instructions/<sanitized-repo>/skills/`
# - Copilot: copies to repo/.github/skills/
# - Claude: copies to repo/.claude/skills/
skills:
enabled: false
enabled: true

# controls:
# 1. whether to copy custom agents from `src/bcbench/agent/shared/instructions/<sanitized-repo>/agents/`
Expand Down
Loading
Loading