Uptake latest bc-eval package for storage options and LMchecklist#647
Open
haoranpb wants to merge 12 commits into
Open
Uptake latest bc-eval package for storage options and LMchecklist#647haoranpb wants to merge 12 commits into
haoranpb wants to merge 12 commits into
Conversation
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
This reverts commit b6a946a.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates BC-Bench to align with newer bc-eval capabilities (storage options + LMChecklist-style expected outputs) and adds small CLI helpers so workflows can derive bc-eval configuration from the selected evaluation category.
Changes:
- Extend dataset entries to support
ExpectedOutputas either a patch string or an LMChecklist-style assertions payload. - Add
EvaluationCategory.evaluatorsandEvaluationCategory.core_scoreplus abcbench category bceval-configCLI command for GitHub Actions step outputs. - Update CI/workflows to use the new category CLI and upgrade the
bc-evaltool installation + invocation parameters.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_type_exhaustiveness.py | Updates expected-output assertions and adds exhaustiveness tests for new evaluators/core_score properties. |
| tests/test_result_writer.py | Adds coverage ensuring dict-based expected payloads are preserved in exported JSONL. |
| tests/test_category_command.py | New tests for bcbench category list and bcbench category bceval-config. |
| src/bcbench/types.py | Adds EvaluationCategory.evaluators and EvaluationCategory.core_score. |
| src/bcbench/results/bceval_export.py | Types expected as ExpectedOutput to allow dict payloads in JSONL export. |
| src/bcbench/dataset/dataset_entry.py | Introduces Checklist/ExpectedOutput type aliases and widens get_expected_output() return type. |
| src/bcbench/dataset/init.py | Re-exports ExpectedOutput. |
| src/bcbench/commands/category.py | New CLI commands for category listing and bc-eval config emission to $GITHUB_OUTPUT. |
| src/bcbench/commands/init.py | Exposes category_app. |
| src/bcbench/cli.py | Registers the new category command group. |
| .github/workflows/summarize-results.yml | Uses bcbench category bceval-config, upgrades bc-eval install, and switches to --storage-based publishing. |
| .github/workflows/CI.yml | Selects random category via bcbench category list (adds checkout + uv setup for that job). |
kmhansen
previously approved these changes
May 21, 2026
kmhansen
approved these changes
May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uptake the latest version of
bc-evalpackage for:Also some refactoring to make sure it is easier to add future categories.