feat(luce-bench): in-tree bench harness + multi-turn agent_recorded + LLM judge by easel · Pull Request #337 · Luce-Org/lucebox-hub

easel · 2026-06-03T21:31:41Z

Summary

Adds the standalone luce-bench/ Python eval harness package (vendored into the repo), plus a multi-turn agent_recorded replay area graded by a Claude Sonnet LLM judge. Also wires up a card-bundle drift check script and CI workflow updates.

Files

luce-bench/ — new top-level package (pyproject, src layout, tests, fixtures, docs)
- src/lucebench/areas/ — eval areas: smoke, agent, agent_recorded, ds4_eval, forge, gsm8k, hellaswag, humaneval, longctx, truthfulqa_mc1
- src/lucebench/grading/llm_judge.py — Sonnet judge with on-disk cache
- src/lucebench/fixtures/ — vendored eval fixtures (incl. forge_eval scenarios)
- tests/ — ~20 test modules covering graders, runners, snapshot, cards
scripts/check_card_bundle_drift.sh — new drift CI helper
.github/workflows/ci.yml — wires luce-bench tests + card bundle drift check

Dependencies

None. This PR is self-contained: pure-Python harness with no source-level references to the server, lucebox CLI, or docker-stack PRs. Independent of #334, #335, #336, #338, #339, #340, #341.

Test plan

cd luce-bench && pytest passes locally
CI green on the new card-bundle-drift job
Dry-run multi-turn agent_recorded against a live server with judge mocked
Confirm default --areas all does not enable the LLM-judge path (cost containment)

Judge cost estimate: ~$0.30-$1.50 per full 48-case pass on Anthropic Sonnet.

Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com

## What Containerization stack for lucebox-hub. Dockerfile + docker-bake.hcl build the lucebox-hub image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars consumed by /props. GitHub Actions add .github/workflows/docker.yml (build & publish), update ci.yml, and add release-luce-bench.yml for tagging. Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README) live here because the Dockerfile uv-syncs the workspace at build time. ## Why Provides the reproducible image and CI pipeline every other split PR deploys into. Centralizing build/publish here keeps Dockerfile, entrypoint, and workspace-root pinning in one reviewable change. ## Dependencies - Luce-Org#335 (lucebox-cli): Dockerfile COPYs lucebox/ into the image - Luce-Org#337 (lucebench-harness): Dockerfile COPYs luce-bench/ into the image

… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts

cubic-dev-ai

16 issues found across 116 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="scripts/check_card_bundle_drift.sh">

<violation number="1" location="scripts/check_card_bundle_drift.sh:18">
P2: Drift guard is one-way and misses extra/stale files in the bundled wheel, so CI can report success even when bundle contents do not exactly match `share/model_cards`.</violation>
</file>

<file name="luce-bench/src/lucebench/fixtures/agent_prompts/codex_apply_patch.md">

<violation number="1" location="luce-bench/src/lucebench/fixtures/agent_prompts/codex_apply_patch.md:325">
P2: Grammar for `Hunk` does not allow multiple `@@` context-scoping lines, contradicting the documented feature and examples. The production `Hunk := "@@" [ header ] NEWLINE { HunkLine }` expects only HunkLines (starting with space, `-`, or `+`) after the `@@` header, but the text explicitly says "use multiple `@@` statements to jump to the right context" and shows consecutive `@@` lines (e.g., `@@ class BaseClass` then `@@ def method():`). This inconsistency will confuse agent implementors or cause strict parsers to reject valid patch syntax.</violation>
</file>

_{Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.

On a pro plan you can use ultrareview for larger PRs.
Tip: cubic used a learning from your PR history. Let your coding agent read cubic learnings directly with the cubic MCP.

Re-trigger cubic}

## What Containerization stack for lucebox-hub. Dockerfile + docker-bake.hcl build the lucebox-hub image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars consumed by /props. GitHub Actions add .github/workflows/docker.yml (build & publish), update ci.yml, and add release-luce-bench.yml for tagging. Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README) live here because the Dockerfile uv-syncs the workspace at build time. ## Why Provides the reproducible image and CI pipeline every other split PR deploys into. Centralizing build/publish here keeps Dockerfile, entrypoint, and workspace-root pinning in one reviewable change. ## Dependencies - Luce-Org#335 (lucebox-cli): Dockerfile COPYs lucebox/ into the image - Luce-Org#337 (lucebench-harness): Dockerfile COPYs luce-bench/ into the image

… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts

## What Containerization stack for lucebox-hub. Dockerfile + docker-bake.hcl build the lucebox-hub image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars consumed by /props. GitHub Actions add .github/workflows/docker.yml (build & publish), update ci.yml, and add release-luce-bench.yml for tagging. Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README) live here because the Dockerfile uv-syncs the workspace at build time. ## Why Provides the reproducible image and CI pipeline every other split PR deploys into. Centralizing build/publish here keeps Dockerfile, entrypoint, and workspace-root pinning in one reviewable change. ## Dependencies - Luce-Org#335 (lucebox-cli): Dockerfile COPYs lucebox/ into the image - Luce-Org#337 (lucebench-harness): Dockerfile COPYs luce-bench/ into the image

… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts

…i-turn agent_recorded + LLM judge

…ge grading ## What Adds the luce-bench/ Python package as a standalone bench harness: - Core areas: smoke, gsm8k, hellaswag, humaneval, truthfulqa_mc1, longctx, agent, agent_recorded (multi-turn replay), forge, ds4_eval. - Multi-turn agent_recorded replay with an LLM-judge grader and per-turn metrics; forge_eval fixture imported under fixtures/forge_eval/_forge/. - Card sampling, snapshot/submit_baseline, model-card schema, thinking-budget client, normalize+regrade pipeline, hostinfo, and CLI entrypoint. - Fixtures: agent_cases, agent_recorded (single + multi_turn), ds4_eval_cases, agent_prompts (codex variants). - Tests: ~25 pytest modules covering every area plus thinking control, normalize/regrade, snapshot, runner, and host-info paths. - scripts/extract-agentic-fixture.py (loaded by test_extract_agentic_fixture.py via path) and the scripts/check_card_bundle_drift.sh CI gate. ## Why Splits the bench harness out of lucebox-hub's main tree so it can ship as its own installable package and be consumed by lucebox bench without pulling in the C++ server build context. Multi-turn replay + LLM-judge grading is what unblocks the coding-agent-loop sweep workflow. ## Dependencies None - this PR is independent.

… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts

easel mentioned this pull request Jun 3, 2026

feat(luce-bench): multi-turn agent_recorded redesign + LLM-judge grading #333

Closed

easel changed the title ~~feat(luce-bench): standalone bench harness package + forge eval area~~ feat(luce-bench): in-tree bench harness + multi-turn agent_recorded + LLM judge Jun 3, 2026

easel force-pushed the feat/lucebench-harness branch from 5ae6880 to 421f852 Compare June 4, 2026 02:50

This was referenced Jun 4, 2026

build(docker): lucebox-hub container image + CI release pipeline #334

Open

feat(lucebox): hub CLI + autotune/sweep/profile + harness adapters + shell wrapper #335

Open

easel force-pushed the feat/lucebench-harness branch from da4e316 to c20a0f2 Compare June 4, 2026 05:03

easel marked this pull request as ready for review June 4, 2026 05:03

cubic-dev-ai Bot reviewed Jun 4, 2026

View reviewed changes

easel mentioned this pull request Jun 4, 2026

feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree #285

Closed

easel force-pushed the feat/lucebench-harness branch from c20a0f2 to 6f27131 Compare June 4, 2026 17:14

easel force-pushed the feat/lucebench-harness branch from 6f27131 to 27dd7f1 Compare June 4, 2026 23:23

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026

Merge PR Luce-Org#337: feat(luce-bench): in-tree bench harness + mult…

518b629

…i-turn agent_recorded + LLM judge

easel force-pushed the feat/lucebench-harness branch from 27dd7f1 to ac972b7 Compare June 5, 2026 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(luce-bench): in-tree bench harness + multi-turn agent_recorded + LLM judge#337

feat(luce-bench): in-tree bench harness + multi-turn agent_recorded + LLM judge#337
easel wants to merge 1 commit into
Luce-Org:mainfrom
easel:feat/lucebench-harness

easel commented Jun 3, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

easel commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Dependencies

Test plan

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

easel commented Jun 3, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading