fix(uat): capture diagnostics when --quiet hides cell failures by joeharris76 · Pull Request #720 · joeharris76/BenchBox

joeharris76 · 2026-05-31T15:19:48Z

Summary

Fast-platform UAT sweeps produced a wall of identical, opaque failures —
status=failed terminal_state=no_json_nonzero submit_terminal_state=missing_manifest
with an empty failure_tail — across clickhouse-server (57/57 and 22/22) and
scattered duckdb / datafusion / lakesail cells. That single signature was hiding
several unrelated root causes, made indistinguishable by one diagnostic gap.

Root cause of the opacity (fixed here): UAT cells run benchbox run --quiet
so the final stdout line is the result-JSON path the runner parses. But --quiet
also suppresses benchbox's own error output, so a non-zero exit left the
per-cell log empty ((no subprocess output captured)) — every distinct failure
collapsed to the same signature.

no_json_nonzero + missing_manifest does not mean "ran but wrote no
manifest". It means benchbox itself exited non-zero: runner.py nulls
result_path whenever exit_code != 0 and forces submit_state = missing_manifest. The classifier is faithful; it just had nothing to report.

Change

tests/uat/matrix.py: add quiet: bool = True to benchbox_run_argv to build
a verbose argv on demand. Default stays True, preserving result-path parsing;
--quiet now appended conditionally (ordering preserved relative to --phases).
tests/uat/runner.py: when a cell exits non-zero with no captured output
(and did not time out), re-run it once verbosely (quiet=False, stderr
merged into stdout, capped at DIAGNOSTIC_RERUN_TIMEOUT_S=180s) via
_append_diagnostic_rerun, appending the transcript as plain lines so
_cell_log_tail surfaces it into failure_tail. Best-effort try/except so
diagnostics never mask the original failure.

This makes the next sweep self-describing — real per-cell errors land in
failure_tail / cells.jsonl instead of being swallowed.

Verification

pytest tests/uat/test_runner.py → 24 passed; tests/uat/test_matrix.py → 185 passed
ruff format / ruff check / ty check clean on both files; py_compile clean
make pr-preflight-fast-tests passed at first push (matrix.py); re-run after
the runner.py commit to validate the full change end-to-end.

Root-cause analysis & follow-ups (diagnosis-gated)

This PR fixes the diagnostic keystone. The remaining root causes need a live
docker-backed repro loop (now far easier with verbose tails). Each should be its
own follow-up:

#	Platform / scope	Likely cause	Next step
RC-1	clickhouse-server (100%, ~7–8s, `docker up=ok`)	Pre-load failure swallowed by `--quiet`; suspect uv-extra wiring or connection contract (cells pass no host/port, unlike starrocks/cedardb)	Re-run one cell with this fix → read real error → fix wiring in `PLATFORM_UV_EXTRA` / `PLATFORM_EXTRA_OPTS`
RC-2	cedardb	(a) `Bind 0.0.0.0:5435 already allocated` (stale container); (b) sweep hung mid-run on `tpcds_obt 1.0` — per-cell JSONs were clean	Per-cell hard wall-clock kill; `docker compose down` in `finally`; pre-`up` port preflight
RC-3	starrocks	Same sweep-hang pattern; per-bench JSONs exist, no `cells.jsonl`/`matrix_summary`	Same as RC-2
RC-4	duckdb / datafusion / lakesail (scattered)	Genuine benchmark failures, correctly reported. e.g. duckdb `write_primitives` load: "No files found that match pattern path_…"; `flightdata`/`datavault` sf=1.0 timeouts	Fix `write_primitives` datagen path-glob; raise `timeout_s` (or mark slow) for sf=1.0 outliers

Notes

No functional change to the benchmark execution path; harness diagnostics only.
The diagnostic re-run fires only on failure with empty output (rare) and is
time-capped, so it adds no cost to passing or already-diagnosed cells.

🤖 Generated with Claude Code

UAT cells run `benchbox run --quiet` so the final stdout line is the result-JSON path parsed by the runner. But --quiet also suppresses benchbox's own error output, so a non-zero exit left the per-cell log empty and every distinct failure collapsed to the same opaque signature (no_json_nonzero / submit=missing_manifest / failure_tail=""). When a cell exits non-zero with no captured output, re-run it once verbosely (quiet=False, stderr merged into stdout, capped at 180s) and append the transcript as plain lines so `_cell_log_tail` surfaces it into `failure_tail`. Add a `quiet` parameter to `benchbox_run_argv` to build the verbose argv; the default stays True to preserve result-path parsing and existing argv ordering. This unblocks root-causing the fast-platform sweep failures (clickhouse-server, scattered duckdb/datafusion/lakesail cells) by making the next run self-describing instead of opaque. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 370a19ce1f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-31T15:21:10Z

    compression: str | None = None,
    extra_args: Iterable[str] = (),
    local_managed_platform: bool = False,
+    quiet: bool = True,


Rerun failed cells without --quiet

This only adds a quiet switch to the argv builder, but run_cell() still calls benchbox_run_argv(...) without passing it and never performs the documented non-zero retry (checked tests/uat/runner.py:129-151). In the scenario this commit targets—benchbox run --quiet exits non-zero with no stdout—the per-cell log remains just the command header and any stderr, so failure_tail is still empty and diagnostics are not captured. Wire a retry path that calls this with quiet=False when the first quiet run fails without captured output.

Useful? React with 👍 / 👎.

Completes the diagnostics fix. The prior commit added the `quiet` parameter to `benchbox_run_argv` but nothing invoked it, leaving it inert. This wires it up in `run_cell`: when a `--quiet` cell exits non-zero with no captured output (and did not time out), re-run it once verbosely via `_append_diagnostic_rerun` (stderr merged into stdout, capped at DIAGNOSTIC_RERUN_TIMEOUT_S=180s) and append the transcript as plain lines so `_cell_log_tail` surfaces the real error into `failure_tail`. Best-effort try/except so diagnostics never mask the original failure. Verified: tests/uat/test_runner.py (24) and test_matrix.py (185) pass; ruff format/check and ty clean; py_compile clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

joeharris76 enabled auto-merge (squash) May 31, 2026 15:19

chatgpt-codex-connector Bot reviewed May 31, 2026

View reviewed changes

joeharris76 merged commit 1b42667 into develop May 31, 2026
7 checks passed

joeharris76 deleted the fix/uat-fast-platform-failures branch May 31, 2026 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(uat): capture diagnostics when --quiet hides cell failures#720

fix(uat): capture diagnostics when --quiet hides cell failures#720
joeharris76 merged 2 commits into
developfrom
fix/uat-fast-platform-failures

joeharris76 commented May 31, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joeharris76 commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change

Verification

Root-cause analysis & follow-ups (diagnosis-gated)

Notes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

joeharris76 commented May 31, 2026 •

edited

Loading