Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
db388bd
few more udpates for new categories
haoranpb Apr 8, 2026
57c004e
Refactor evaluation and dataset operations for improved workspace setup
haoranpb Apr 9, 2026
8e2f216
enable skipping container setup in action
haoranpb Apr 9, 2026
69a8db8
fix missing implementation for MockEvaluationPipeline
haoranpb Apr 9, 2026
7549d92
Refactor evaluation result classes to be more generic
haoranpb Apr 11, 2026
f32dd00
Merge branch 'main' into fix/more-ready-for-categories
haoranpb Apr 12, 2026
a4089b9
Improve readabilty of GitHub Action summary
haoranpb Apr 12, 2026
99af6b2
fix failing tests
haoranpb Apr 12, 2026
e1b0b93
Code Review POC
haoranpb Apr 12, 2026
1a68d78
Merge branch 'main' into category/code-review
haoranpb Apr 13, 2026
3ec10a0
fix merge conflict resolution mistake
haoranpb Apr 13, 2026
4e52832
Merge branch 'main' into category/code-review
haoranpb Apr 13, 2026
a9f59d9
Make container parameters optional in evaluate and run commands
haoranpb Apr 13, 2026
065e1aa
Merge branch 'category/code-review' of https://github.com/microsoft/B…
haoranpb Apr 13, 2026
4ad4bd9
Enhance code review functionality by adding expected review comments …
haoranpb Apr 13, 2026
92951c4
better hanlding container for not required categories
haoranpb Apr 13, 2026
7902610
Merge branch 'main' into category/code-review
haoranpb Apr 20, 2026
dad9289
Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…
haoranpb May 5, 2026
f1c4894
Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…
haoranpb May 11, 2026
aa48a29
prefer copilot.exe executable
haoranpb May 12, 2026
a244503
Normalize code-review dataset and preserve eval outputs
WaelAbuSeada May 16, 2026
9f6c353
Fix code-review branch setup and workflow wiring
WaelAbuSeada May 20, 2026
1a58e44
Require review.json and add log-based recovery fallback
WaelAbuSeada May 20, 2026
d0e8076
Harden code-review prompt for Windows copilot.cmd parsing
WaelAbuSeada May 20, 2026
7411246
Expand code-review detailed table metrics
WaelAbuSeada May 21, 2026
c7131a4
Update config and container setup action
WaelAbuSeada May 21, 2026
213ce7f
Remove unused apply_patch import from code-review evaluate
WaelAbuSeada May 21, 2026
2e1ced0
Refactor code-review metrics into pipeline and split comment display …
WaelAbuSeada May 21, 2026
3ff6876
Normalize code-review test-run instance IDs to valid pattern
WaelAbuSeada May 21, 2026
6e68751
Use plain code-review IDs (security_001 style) and relax ID pattern
WaelAbuSeada May 21, 2026
d691d26
Revert instance_id regex to original strict pattern
WaelAbuSeada May 21, 2026
4c7e03c
Rename code-review test IDs to strict non-vsoadmin format
WaelAbuSeada May 21, 2026
85503e8
fix: add dataset-path input to setup-bc-container action
WaelAbuSeada May 21, 2026
06ee0b9
feat: add precision and recall to detailed results table
WaelAbuSeada May 21, 2026
f5ffe80
fix: apply pre-commit lint and typing fixes
WaelAbuSeada May 21, 2026
8835a18
chore: remove UI instruction file
WaelAbuSeada May 21, 2026
aabee80
fix: review code changes from applied entry patch
WaelAbuSeada May 21, 2026
1f7a5bd
fix: support simplified code-review patch materialization
WaelAbuSeada May 21, 2026
6777610
fix: tighten code-review diff and parsing behavior
WaelAbuSeada May 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .github/actions/setup-bc-container/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ inputs:
github-token:
description: GitHub token for accessing public repositories
required: true
dataset-path:
description: Path to the dataset file
required: false
default: "dataset/bcbench.jsonl"
skip-container:
description: Skip BC container setup (only clone repository)
required: false
Expand Down Expand Up @@ -66,5 +70,5 @@ runs:
$env:ADO_TOKEN = az account get-access-token --resource "499b84ac-1321-427f-aa17-267ca6975798" --query accessToken -o tsv
Write-Output "::add-mask::$env:ADO_TOKEN"

.\scripts\Setup-ContainerAndRepository.ps1 -InstanceId "${{ inputs.instance-id }}" ${{ inputs.skip-container == 'true' && '-SkipContainer' || '' }}
.\scripts\Setup-ContainerAndRepository.ps1 -InstanceId "${{ inputs.instance-id }}" -DatasetPath "${{ inputs.dataset-path }}" ${{ inputs.skip-container == 'true' && '-SkipContainer' || '' }}
shell: pwsh
2 changes: 1 addition & 1 deletion .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ jobs:
id: random
shell: pwsh
run: |
$categories = @("bug-fix", "test-generation")
$categories = @("bug-fix", "test-generation", "code-review")
$selected = $categories | Get-Random
echo "category=$selected" >> $env:GITHUB_OUTPUT
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/claude-evaluation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ on:
options:
- "bug-fix"
- "test-generation"
- "code-review"
test-run:
description: "Indicate this is a test run (with few entries)"
required: false
Expand Down Expand Up @@ -90,6 +91,8 @@ jobs:
azure-client-id: ${{ secrets.AZURE_CLIENT_ID }}
azure-tenant-id: ${{ secrets.AZURE_TENANT_ID }}
github-token: ${{ secrets.GITHUB_TOKEN }}
dataset-path: ${{ inputs.category == 'code-review' && 'dataset/codereview.jsonl' || 'dataset/bcbench.jsonl' }}
skip-container: ${{ inputs.category == 'code-review' }}

- name: Setup Python with UV
uses: ./.github/actions/setup-python-uv
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/copilot-evaluation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ on:
options:
- "bug-fix"
- "test-generation"
- "code-review"
test-run:
description: "Indicate this is a test run (with few entries)"
required: false
Expand Down Expand Up @@ -97,6 +98,8 @@ jobs:
azure-client-id: ${{ secrets.AZURE_CLIENT_ID }}
azure-tenant-id: ${{ secrets.AZURE_TENANT_ID }}
github-token: ${{ secrets.GITHUB_TOKEN }}
dataset-path: ${{ inputs.category == 'code-review' && 'dataset/codereview.jsonl' || 'dataset/bcbench.jsonl' }}
skip-container: ${{ inputs.category == 'code-review' }}

- name: Setup Python with UV
uses: ./.github/actions/setup-python-uv
Expand Down
10 changes: 9 additions & 1 deletion .github/workflows/summarize-results.yml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,14 @@ jobs:
BRAINTRUST_PROJECT_ID: ${{ secrets.BRAINTRUST_PROJECT_ID }}
BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
run: |
if [ "${{ inputs.category }}" = "code-review" ]; then
EVALUATORS="precision_score,recall_score,f1_score,valid_review_output"
elif [ "${{ inputs.category }}" = "test-generation" ]; then
EVALUATORS="resolution_rate,build_rate,pre_patch_failed_rate,post_patch_passed_rate"
else
EVALUATORS="resolution_rate,build_rate"
fi

# Get Azure DevOps access token from Azure CLI (uses the OIDC token from azure/login)
ADO_TOKEN=$(az account get-access-token --resource 499b84ac-1321-427f-aa17-267ca6975798 --query accessToken -o tsv)
echo "::add-mask::$ADO_TOKEN"
Expand All @@ -88,7 +96,7 @@ jobs:
bceval metrics calculate \
--input-file "${{ inputs.results-dir }}/${{ github.run_id }}/${{ env.BCEVAL_RESULT_FILE }}" \
--evaluator-definitions "${{ github.workspace }}/evaluator/scores.py" \
--evaluators "resolution_rate,build_rate${{ inputs.category == 'test-generation' && ',pre_patch_failed_rate,post_patch_passed_rate' || '' }}" \
--evaluators "${EVALUATORS}" \
--metric-definitions "${{ github.workspace }}/evaluator/metrics.py" \
--metrics "bc_bench_metrics" \
--eval-run-name "${{ inputs.agent }} (${{ inputs.model }}) - #${{ github.run_id }}" \
Expand Down
82 changes: 82 additions & 0 deletions dataset/codereview.jsonl

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions docs/_data/code-review.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"runs": [],
"aggregate": []
}
20 changes: 20 additions & 0 deletions evaluator/scores.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,23 @@ def __call__(self, *, metadata: dict, **kwargs: object) -> bool:
class PostPatchPassedRate:
def __call__(self, *, metadata: dict, **kwargs: object) -> bool:
return metadata.get("post_patch_passed", False)


class PrecisionScore:
def __call__(self, *, metadata: dict, **kwargs: object) -> float:
return float(metadata.get("precision", 0.0))


class RecallScore:
def __call__(self, *, metadata: dict, **kwargs: object) -> float:
return float(metadata.get("recall", 0.0))


class F1Score:
def __call__(self, *, metadata: dict, **kwargs: object) -> float:
return float(metadata.get("f1", 0.0))


class ValidReviewOutput:
def __call__(self, *, metadata: dict, **kwargs: object) -> bool:
return bool(metadata.get("valid_review_output", False))
4 changes: 3 additions & 1 deletion src/bcbench/agent/copilot/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,9 @@ def run_copilot_agent(
logger.info(f"Executing Copilot CLI in directory: {repo_path}")
logger.debug(f"Using prompt:\n{prompt}")

copilot_cmd = shutil.which("copilot.cmd") or shutil.which("copilot")
# Prefer copilot.exe over copilot.bat/copilot.cmd shims on Windows: the .bat shim invokes PowerShell,
# which re-parses arguments and corrupts prompts containing double quotes (e.g. JSON examples).
copilot_cmd = shutil.which("copilot.exe") or shutil.which("copilot.cmd") or shutil.which("copilot")
if not copilot_cmd:
raise AgentError("Copilot CLI not found in PATH. Please ensure it is installed and available.")

Expand Down
18 changes: 16 additions & 2 deletions src/bcbench/agent/shared/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,20 @@ prompt:
{{task}}
{% endif %}

code-review-template: |
/review

Review the current branch changes and save findings to a file named "review.json" in the repository root.

Review ONLY the current working-tree AL file changes for this evaluation entry.
Do NOT compare commits (for example, do NOT use HEAD~1..HEAD or origin/main comparisons).
Use working tree diff only (git diff HEAD), and focus on changed *.al files.

The file must contain valid JSON with a top-level object named findings.
Each finding must include: filePath, lineNumber, severity, issue, recommendation.
Allowed severity values are: critical, high, medium, low.
If there are no findings, write an empty findings list.

# controls:
# 1. whether to copy custom instructions from `src/bcbench/agent/shared/instructions/<sanitized-repo>/`
# - Copilot: copies to repo/.github/ and renames AGENTS.md to copilot-instructions.md
Expand All @@ -63,14 +77,14 @@ prompt:
# NOTE: the canonical source file is AGENTS.md; it is automatically renamed
# to the agent-specific filename (AgentType.instruction_filename) during setup
instructions:
enabled: false
enabled: true

# controls:
# 1. whether to copy skills from `src/bcbench/agent/shared/instructions/<sanitized-repo>/skills/`
# - Copilot: copies to repo/.github/skills/
# - Claude: copies to repo/.claude/skills/
skills:
enabled: false
enabled: true

# controls:
# 1. whether to copy custom agents from `src/bcbench/agent/shared/instructions/<sanitized-repo>/agents/`
Expand Down
Loading