Skip to content

Uptake latest bc-eval package for storage options and LMchecklist#647

Open
haoranpb wants to merge 12 commits into
mainfrom
feature/uptake-latest-bceval
Open

Uptake latest bc-eval package for storage options and LMchecklist#647
haoranpb wants to merge 12 commits into
mainfrom
feature/uptake-latest-bceval

Conversation

@haoranpb
Copy link
Copy Markdown
Collaborator

@haoranpb haoranpb commented May 21, 2026

Uptake the latest version of bc-eval package for:

  1. Storage options (also upload results to Kusto)
  2. LMchecklist

Also some refactoring to make sure it is easier to add future categories.

  • Stop upload result for test runs before merge the code

Comment thread src/bcbench/dataset/dataset_entry.py Dismissed
haoranpb and others added 2 commits May 21, 2026 14:14
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates BC-Bench to align with newer bc-eval capabilities (storage options + LMChecklist-style expected outputs) and adds small CLI helpers so workflows can derive bc-eval configuration from the selected evaluation category.

Changes:

  • Extend dataset entries to support ExpectedOutput as either a patch string or an LMChecklist-style assertions payload.
  • Add EvaluationCategory.evaluators and EvaluationCategory.core_score plus a bcbench category bceval-config CLI command for GitHub Actions step outputs.
  • Update CI/workflows to use the new category CLI and upgrade the bc-eval tool installation + invocation parameters.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_type_exhaustiveness.py Updates expected-output assertions and adds exhaustiveness tests for new evaluators/core_score properties.
tests/test_result_writer.py Adds coverage ensuring dict-based expected payloads are preserved in exported JSONL.
tests/test_category_command.py New tests for bcbench category list and bcbench category bceval-config.
src/bcbench/types.py Adds EvaluationCategory.evaluators and EvaluationCategory.core_score.
src/bcbench/results/bceval_export.py Types expected as ExpectedOutput to allow dict payloads in JSONL export.
src/bcbench/dataset/dataset_entry.py Introduces Checklist/ExpectedOutput type aliases and widens get_expected_output() return type.
src/bcbench/dataset/init.py Re-exports ExpectedOutput.
src/bcbench/commands/category.py New CLI commands for category listing and bc-eval config emission to $GITHUB_OUTPUT.
src/bcbench/commands/init.py Exposes category_app.
src/bcbench/cli.py Registers the new category command group.
.github/workflows/summarize-results.yml Uses bcbench category bceval-config, upgrades bc-eval install, and switches to --storage-based publishing.
.github/workflows/CI.yml Selects random category via bcbench category list (adds checkout + uv setup for that job).

Comment thread .github/workflows/summarize-results.yml Outdated
Comment thread .github/workflows/summarize-results.yml
Comment thread src/bcbench/results/bceval_export.py
@haoranpb haoranpb marked this pull request as ready for review May 21, 2026 12:51
kmhansen
kmhansen previously approved these changes May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants