Skip to content

feat(luce-bench): in-tree bench harness + multi-turn agent_recorded + LLM judge#337

Open
easel wants to merge 1 commit into
Luce-Org:mainfrom
easel:feat/lucebench-harness
Open

feat(luce-bench): in-tree bench harness + multi-turn agent_recorded + LLM judge#337
easel wants to merge 1 commit into
Luce-Org:mainfrom
easel:feat/lucebench-harness

Conversation

@easel
Copy link
Copy Markdown
Collaborator

@easel easel commented Jun 3, 2026

Summary

Adds the standalone luce-bench/ Python eval harness package (vendored into the repo), plus a multi-turn agent_recorded replay area graded by a Claude Sonnet LLM judge. Also wires up a card-bundle drift check script and CI workflow updates.

Files

  • luce-bench/ — new top-level package (pyproject, src layout, tests, fixtures, docs)
    • src/lucebench/areas/ — eval areas: smoke, agent, agent_recorded, ds4_eval, forge, gsm8k, hellaswag, humaneval, longctx, truthfulqa_mc1
    • src/lucebench/grading/llm_judge.py — Sonnet judge with on-disk cache
    • src/lucebench/fixtures/ — vendored eval fixtures (incl. forge_eval scenarios)
    • tests/ — ~20 test modules covering graders, runners, snapshot, cards
  • scripts/check_card_bundle_drift.sh — new drift CI helper
  • .github/workflows/ci.yml — wires luce-bench tests + card bundle drift check

Dependencies

None. This PR is self-contained: pure-Python harness with no source-level references to the server, lucebox CLI, or docker-stack PRs. Independent of #334, #335, #336, #338, #339, #340, #341.

Test plan

  • cd luce-bench && pytest passes locally
  • CI green on the new card-bundle-drift job
  • Dry-run multi-turn agent_recorded against a live server with judge mocked
  • Confirm default --areas all does not enable the LLM-judge path (cost containment)

Judge cost estimate: ~$0.30-$1.50 per full 48-case pass on Anthropic Sonnet.

Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com

@easel easel changed the title feat(luce-bench): standalone bench harness package + forge eval area feat(luce-bench): in-tree bench harness + multi-turn agent_recorded + LLM judge Jun 3, 2026
@easel easel force-pushed the feat/lucebench-harness branch from 5ae6880 to 421f852 Compare June 4, 2026 02:50
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
## What

Containerization stack for lucebox-hub. Dockerfile + docker-bake.hcl
build the lucebox-hub image (build-env and runtime stages);
scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh
emits IMAGE_INFO / HOST_INFO sidecars consumed by /props. GitHub Actions
add .github/workflows/docker.yml (build & publish), update ci.yml, and
add release-luce-bench.yml for tagging. Workspace-root files
(pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README)
live here because the Dockerfile uv-syncs the workspace at build time.

## Why

Provides the reproducible image and CI pipeline every other split PR
deploys into. Centralizing build/publish here keeps Dockerfile,
entrypoint, and workspace-root pinning in one reviewable change.

## Dependencies

- Luce-Org#335 (lucebox-cli): Dockerfile COPYs lucebox/ into the image
- Luce-Org#337 (lucebench-harness): Dockerfile COPYs luce-bench/ into the image
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
… adapters

## What

New lucebox/ Python package exposing the hub CLI (autotune, sweep,
profile, smoke, models, config, download, host-check, docker_run) plus
the lucebox.sh launcher wrapper and install.sh. Adds the harness/
adapter package wrapping external coding agents (claude_code, codex,
hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships
scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh
for wrapper validation, full pytest coverage under lucebox/tests/, and
the bragi autotune profile-sweep protocol docs.

## Why

This is the user-facing surface of lucebox-hub: one CLI to launch the
image, tune layer-split / pflash settings against a host, run sweeps,
and dispatch bench runs. Splitting it out keeps Python-side review
independent of the C++ server and Docker stack reviews.

## Dependencies

- Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image
- Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep)
- Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
@easel easel force-pushed the feat/lucebench-harness branch from da4e316 to c20a0f2 Compare June 4, 2026 05:03
@easel easel marked this pull request as ready for review June 4, 2026 05:03
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 issues found across 116 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="scripts/check_card_bundle_drift.sh">

<violation number="1" location="scripts/check_card_bundle_drift.sh:18">
P2: Drift guard is one-way and misses extra/stale files in the bundled wheel, so CI can report success even when bundle contents do not exactly match `share/model_cards`.</violation>
</file>

<file name="luce-bench/src/lucebench/fixtures/agent_prompts/codex_apply_patch.md">

<violation number="1" location="luce-bench/src/lucebench/fixtures/agent_prompts/codex_apply_patch.md:325">
P2: Grammar for `Hunk` does not allow multiple `@@` context-scoping lines, contradicting the documented feature and examples. The production `Hunk := "@@" [ header ] NEWLINE { HunkLine }` expects only HunkLines (starting with space, `-`, or `+`) after the `@@` header, but the text explicitly says "use multiple `@@` statements to jump to the right context" and shows consecutive `@@` lines (e.g., `@@ class BaseClass` then `@@ def method():`). This inconsistency will confuse agent implementors or cause strict parsers to reject valid patch syntax.</violation>
</file>

Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.
Tip: cubic used a learning from your PR history. Let your coding agent read cubic learnings directly with the cubic MCP.

Re-trigger cubic

Comment thread luce-bench/src/lucebench/areas/longctx.py Outdated
Comment thread luce-bench/src/lucebench/areas/_mc.py Outdated
Comment thread luce-bench/src/lucebench/areas/forge.py Outdated
Comment thread luce-bench/src/lucebench/areas/agent_recorded.py
Comment thread luce-bench/src/lucebench/areas/agent_recorded.py Outdated
Comment thread luce-bench/src/lucebench/areas/forge.py Outdated
Comment thread luce-bench/src/lucebench/areas/forge.py Outdated
Comment thread luce-bench/src/lucebench/__init__.py Outdated
Comment thread luce-bench/src/lucebench/areas/_mc.py Outdated
Comment thread luce-bench/README.md Outdated
@easel easel force-pushed the feat/lucebench-harness branch from c20a0f2 to 6f27131 Compare June 4, 2026 17:14
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
## What

Containerization stack for lucebox-hub. Dockerfile + docker-bake.hcl
build the lucebox-hub image (build-env and runtime stages);
scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh
emits IMAGE_INFO / HOST_INFO sidecars consumed by /props. GitHub Actions
add .github/workflows/docker.yml (build & publish), update ci.yml, and
add release-luce-bench.yml for tagging. Workspace-root files
(pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README)
live here because the Dockerfile uv-syncs the workspace at build time.

## Why

Provides the reproducible image and CI pipeline every other split PR
deploys into. Centralizing build/publish here keeps Dockerfile,
entrypoint, and workspace-root pinning in one reviewable change.

## Dependencies

- Luce-Org#335 (lucebox-cli): Dockerfile COPYs lucebox/ into the image
- Luce-Org#337 (lucebench-harness): Dockerfile COPYs luce-bench/ into the image
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
… adapters

## What

New lucebox/ Python package exposing the hub CLI (autotune, sweep,
profile, smoke, models, config, download, host-check, docker_run) plus
the lucebox.sh launcher wrapper and install.sh. Adds the harness/
adapter package wrapping external coding agents (claude_code, codex,
hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships
scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh
for wrapper validation, full pytest coverage under lucebox/tests/, and
the bragi autotune profile-sweep protocol docs.

## Why

This is the user-facing surface of lucebox-hub: one CLI to launch the
image, tune layer-split / pflash settings against a host, run sweeps,
and dispatch bench runs. Splitting it out keeps Python-side review
independent of the C++ server and Docker stack reviews.

## Dependencies

- Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image
- Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep)
- Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
… adapters

## What

New lucebox/ Python package exposing the hub CLI (autotune, sweep,
profile, smoke, models, config, download, host-check, docker_run) plus
the lucebox.sh launcher wrapper and install.sh. Adds the harness/
adapter package wrapping external coding agents (claude_code, codex,
hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships
scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh
for wrapper validation, full pytest coverage under lucebox/tests/, and
the bragi autotune profile-sweep protocol docs.

## Why

This is the user-facing surface of lucebox-hub: one CLI to launch the
image, tune layer-split / pflash settings against a host, run sweeps,
and dispatch bench runs. Splitting it out keeps Python-side review
independent of the C++ server and Docker stack reviews.

## Dependencies

- Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image
- Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep)
- Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
## What

Containerization stack for lucebox-hub. Dockerfile + docker-bake.hcl
build the lucebox-hub image (build-env and runtime stages);
scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh
emits IMAGE_INFO / HOST_INFO sidecars consumed by /props. GitHub Actions
add .github/workflows/docker.yml (build & publish), update ci.yml, and
add release-luce-bench.yml for tagging. Workspace-root files
(pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README)
live here because the Dockerfile uv-syncs the workspace at build time.

## Why

Provides the reproducible image and CI pipeline every other split PR
deploys into. Centralizing build/publish here keeps Dockerfile,
entrypoint, and workspace-root pinning in one reviewable change.

## Dependencies

- Luce-Org#335 (lucebox-cli): Dockerfile COPYs lucebox/ into the image
- Luce-Org#337 (lucebench-harness): Dockerfile COPYs luce-bench/ into the image
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
… adapters

## What

New lucebox/ Python package exposing the hub CLI (autotune, sweep,
profile, smoke, models, config, download, host-check, docker_run) plus
the lucebox.sh launcher wrapper and install.sh. Adds the harness/
adapter package wrapping external coding agents (claude_code, codex,
hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships
scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh
for wrapper validation, full pytest coverage under lucebox/tests/, and
the bragi autotune profile-sweep protocol docs.

## Why

This is the user-facing surface of lucebox-hub: one CLI to launch the
image, tune layer-split / pflash settings against a host, run sweeps,
and dispatch bench runs. Splitting it out keeps Python-side review
independent of the C++ server and Docker stack reviews.

## Dependencies

- Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image
- Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep)
- Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
… adapters

## What

New lucebox/ Python package exposing the hub CLI (autotune, sweep,
profile, smoke, models, config, download, host-check, docker_run) plus
the lucebox.sh launcher wrapper and install.sh. Adds the harness/
adapter package wrapping external coding agents (claude_code, codex,
hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships
scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh
for wrapper validation, full pytest coverage under lucebox/tests/, and
the bragi autotune profile-sweep protocol docs.

## Why

This is the user-facing surface of lucebox-hub: one CLI to launch the
image, tune layer-split / pflash settings against a host, run sweeps,
and dispatch bench runs. Splitting it out keeps Python-side review
independent of the C++ server and Docker stack reviews.

## Dependencies

- Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image
- Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep)
- Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
@easel easel force-pushed the feat/lucebench-harness branch from 6f27131 to 27dd7f1 Compare June 4, 2026 23:23
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
… adapters

## What

New lucebox/ Python package exposing the hub CLI (autotune, sweep,
profile, smoke, models, config, download, host-check, docker_run) plus
the lucebox.sh launcher wrapper and install.sh. Adds the harness/
adapter package wrapping external coding agents (claude_code, codex,
hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships
scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh
for wrapper validation, full pytest coverage under lucebox/tests/, and
the bragi autotune profile-sweep protocol docs.

## Why

This is the user-facing surface of lucebox-hub: one CLI to launch the
image, tune layer-split / pflash settings against a host, run sweeps,
and dispatch bench runs. Splitting it out keeps Python-side review
independent of the C++ server and Docker stack reviews.

## Dependencies

- Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image
- Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep)
- Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026
…ge grading

## What

Adds the luce-bench/ Python package as a standalone bench harness:

- Core areas: smoke, gsm8k, hellaswag, humaneval, truthfulqa_mc1,
  longctx, agent, agent_recorded (multi-turn replay), forge, ds4_eval.
- Multi-turn agent_recorded replay with an LLM-judge grader and
  per-turn metrics; forge_eval fixture imported under
  fixtures/forge_eval/_forge/.
- Card sampling, snapshot/submit_baseline, model-card schema,
  thinking-budget client, normalize+regrade pipeline, hostinfo, and
  CLI entrypoint.
- Fixtures: agent_cases, agent_recorded (single + multi_turn),
  ds4_eval_cases, agent_prompts (codex variants).
- Tests: ~25 pytest modules covering every area plus thinking
  control, normalize/regrade, snapshot, runner, and host-info paths.
- scripts/extract-agentic-fixture.py (loaded by
  test_extract_agentic_fixture.py via path) and the
  scripts/check_card_bundle_drift.sh CI gate.

## Why

Splits the bench harness out of lucebox-hub's main tree so it can
ship as its own installable package and be consumed by lucebox bench
without pulling in the C++ server build context. Multi-turn replay +
LLM-judge grading is what unblocks the coding-agent-loop sweep
workflow.

## Dependencies

None - this PR is independent.
@easel easel force-pushed the feat/lucebench-harness branch from 27dd7f1 to ac972b7 Compare June 5, 2026 20:01
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026
… adapters

## What

New lucebox/ Python package exposing the hub CLI (autotune, sweep,
profile, smoke, models, config, download, host-check, docker_run) plus
the lucebox.sh launcher wrapper and install.sh. Adds the harness/
adapter package wrapping external coding agents (claude_code, codex,
hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships
scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh
for wrapper validation, full pytest coverage under lucebox/tests/, and
the bragi autotune profile-sweep protocol docs.

## Why

This is the user-facing surface of lucebox-hub: one CLI to launch the
image, tune layer-split / pflash settings against a host, run sweeps,
and dispatch bench runs. Splitting it out keeps Python-side review
independent of the C++ server and Docker stack reviews.

## Dependencies

- Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image
- Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep)
- Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant