Uptake latest bc-eval package for storage options and LMchecklist by haoranpb · Pull Request #647 · microsoft/BC-Bench

haoranpb · 2026-05-21T10:48:30Z

Uptake the latest version of bc-eval package for:

Storage options (also upload results to Kusto)
LMchecklist

Also some refactoring to make sure it is easier to add future categories.

Stop upload result for test runs before merge the code

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

This reverts commit b6a946a.

Copilot

Pull request overview

This PR updates BC-Bench to align with newer bc-eval capabilities (storage options + LMChecklist-style expected outputs) and adds small CLI helpers so workflows can derive bc-eval configuration from the selected evaluation category.

Changes:

Extend dataset entries to support ExpectedOutput as either a patch string or an LMChecklist-style assertions payload.
Add EvaluationCategory.evaluators and EvaluationCategory.core_score plus a bcbench category bceval-config CLI command for GitHub Actions step outputs.
Update CI/workflows to use the new category CLI and upgrade the bc-eval tool installation + invocation parameters.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_type_exhaustiveness.py	Updates expected-output assertions and adds exhaustiveness tests for new `evaluators`/`core_score` properties.
tests/test_result_writer.py	Adds coverage ensuring dict-based `expected` payloads are preserved in exported JSONL.
tests/test_category_command.py	New tests for `bcbench category list` and `bcbench category bceval-config`.
src/bcbench/types.py	Adds `EvaluationCategory.evaluators` and `EvaluationCategory.core_score`.
src/bcbench/results/bceval_export.py	Types `expected` as `ExpectedOutput` to allow dict payloads in JSONL export.
src/bcbench/dataset/dataset_entry.py	Introduces `Checklist`/`ExpectedOutput` type aliases and widens `get_expected_output()` return type.
src/bcbench/dataset/init.py	Re-exports `ExpectedOutput`.
src/bcbench/commands/category.py	New CLI commands for category listing and bc-eval config emission to `$GITHUB_OUTPUT`.
src/bcbench/commands/init.py	Exposes `category_app`.
src/bcbench/cli.py	Registers the new `category` command group.
.github/workflows/summarize-results.yml	Uses `bcbench category bceval-config`, upgrades bc-eval install, and switches to `--storage`-based publishing.
.github/workflows/CI.yml	Selects random category via `bcbench category list` (adds checkout + uv setup for that job).

haoranpb added 6 commits May 21, 2026 12:40

add category command to allow extensible bc-eval upload

386a897

fix core score must match metrics class

38f174f

fix decripated upload result

c25a2a5

fix test every category must have core score

903ebd5

fix more tests related to core score

5170186

enable LMchecklist in general

572c748

github-code-quality Bot found potential problems May 21, 2026

View reviewed changes

Comment thread src/bcbench/dataset/dataset_entry.py Dismissed

haoranpb and others added 2 commits May 21, 2026 14:14

consistent style for types

b6a946a

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Revert "consistent style for types"

5b4c414

This reverts commit b6a946a.

haoranpb requested a review from Copilot May 21, 2026 12:17

Copilot started reviewing on behalf of haoranpb May 21, 2026 12:17 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread .github/workflows/summarize-results.yml Outdated

Comment thread .github/workflows/summarize-results.yml

Comment thread src/bcbench/results/bceval_export.py

haoranpb added 2 commits May 21, 2026 14:23

rename input to task_inpuit avoid naming conflict

c2370e6

enfore category in resutl summery

c28ae6f

haoranpb marked this pull request as ready for review May 21, 2026 12:51

kmhansen previously approved these changes May 21, 2026

View reviewed changes

haoranpb added 2 commits May 21, 2026 15:08

remote storage name from description

f46360f

conditionally include storage options in leaderboard update step

84ad99c

haoranpb dismissed kmhansen’s stale review via 84ad99c May 21, 2026 13:55

kmhansen approved these changes May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uptake latest bc-eval package for storage options and LMchecklist#647

Uptake latest bc-eval package for storage options and LMchecklist#647
haoranpb wants to merge 12 commits into
mainfrom
feature/uptake-latest-bceval

haoranpb commented May 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

haoranpb commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

haoranpb commented May 21, 2026 •

edited

Loading