fix(uat): capture diagnostics when --quiet hides cell failures#720
Conversation
UAT cells run `benchbox run --quiet` so the final stdout line is the result-JSON path parsed by the runner. But --quiet also suppresses benchbox's own error output, so a non-zero exit left the per-cell log empty and every distinct failure collapsed to the same opaque signature (no_json_nonzero / submit=missing_manifest / failure_tail=""). When a cell exits non-zero with no captured output, re-run it once verbosely (quiet=False, stderr merged into stdout, capped at 180s) and append the transcript as plain lines so `_cell_log_tail` surfaces it into `failure_tail`. Add a `quiet` parameter to `benchbox_run_argv` to build the verbose argv; the default stays True to preserve result-path parsing and existing argv ordering. This unblocks root-causing the fast-platform sweep failures (clickhouse-server, scattered duckdb/datafusion/lakesail cells) by making the next run self-describing instead of opaque. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 370a19ce1f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| compression: str | None = None, | ||
| extra_args: Iterable[str] = (), | ||
| local_managed_platform: bool = False, | ||
| quiet: bool = True, |
There was a problem hiding this comment.
Rerun failed cells without --quiet
This only adds a quiet switch to the argv builder, but run_cell() still calls benchbox_run_argv(...) without passing it and never performs the documented non-zero retry (checked tests/uat/runner.py:129-151). In the scenario this commit targets—benchbox run --quiet exits non-zero with no stdout—the per-cell log remains just the command header and any stderr, so failure_tail is still empty and diagnostics are not captured. Wire a retry path that calls this with quiet=False when the first quiet run fails without captured output.
Useful? React with 👍 / 👎.
Completes the diagnostics fix. The prior commit added the `quiet` parameter to `benchbox_run_argv` but nothing invoked it, leaving it inert. This wires it up in `run_cell`: when a `--quiet` cell exits non-zero with no captured output (and did not time out), re-run it once verbosely via `_append_diagnostic_rerun` (stderr merged into stdout, capped at DIAGNOSTIC_RERUN_TIMEOUT_S=180s) and append the transcript as plain lines so `_cell_log_tail` surfaces the real error into `failure_tail`. Best-effort try/except so diagnostics never mask the original failure. Verified: tests/uat/test_runner.py (24) and test_matrix.py (185) pass; ruff format/check and ty clean; py_compile clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Fast-platform UAT sweeps produced a wall of identical, opaque failures —
status=failed terminal_state=no_json_nonzero submit_terminal_state=missing_manifestwith an empty
failure_tail— across clickhouse-server (57/57 and 22/22) andscattered duckdb / datafusion / lakesail cells. That single signature was hiding
several unrelated root causes, made indistinguishable by one diagnostic gap.
Root cause of the opacity (fixed here): UAT cells run
benchbox run --quietso the final stdout line is the result-JSON path the runner parses. But
--quietalso suppresses benchbox's own error output, so a non-zero exit left the
per-cell log empty (
(no subprocess output captured)) — every distinct failurecollapsed to the same signature.
no_json_nonzero+missing_manifestdoes not mean "ran but wrote nomanifest". It means benchbox itself exited non-zero:
runner.pynullsresult_pathwheneverexit_code != 0and forcessubmit_state = missing_manifest. The classifier is faithful; it just had nothing to report.Change
tests/uat/matrix.py: addquiet: bool = Truetobenchbox_run_argvto builda verbose argv on demand. Default stays
True, preserving result-path parsing;--quietnow appended conditionally (ordering preserved relative to--phases).tests/uat/runner.py: when a cell exits non-zero with no captured output(and did not time out), re-run it once verbosely (
quiet=False, stderrmerged into stdout, capped at
DIAGNOSTIC_RERUN_TIMEOUT_S=180s) via_append_diagnostic_rerun, appending the transcript as plain lines so_cell_log_tailsurfaces it intofailure_tail. Best-efforttry/exceptsodiagnostics never mask the original failure.
This makes the next sweep self-describing — real per-cell errors land in
failure_tail/cells.jsonlinstead of being swallowed.Verification
pytest tests/uat/test_runner.py→ 24 passed;tests/uat/test_matrix.py→ 185 passedruff format/ruff check/ty checkclean on both files;py_compilecleanmake pr-preflight-fast-testspassed at first push (matrix.py); re-run afterthe runner.py commit to validate the full change end-to-end.
Root-cause analysis & follow-ups (diagnosis-gated)
This PR fixes the diagnostic keystone. The remaining root causes need a live
docker-backed repro loop (now far easier with verbose tails). Each should be its
own follow-up:
docker up=ok)--quiet; suspect uv-extra wiring or connection contract (cells pass no host/port, unlike starrocks/cedardb)PLATFORM_UV_EXTRA/PLATFORM_EXTRA_OPTSBind 0.0.0.0:5435 already allocated(stale container); (b) sweep hung mid-run ontpcds_obt 1.0— per-cell JSONs were cleandocker compose downinfinally; pre-upport preflightcells.jsonl/matrix_summarywrite_primitivesload: "No files found that match pattern path_…";flightdata/datavaultsf=1.0 timeoutswrite_primitivesdatagen path-glob; raisetimeout_s(or mark slow) for sf=1.0 outliersNotes
time-capped, so it adds no cost to passing or already-diagnosed cells.
🤖 Generated with Claude Code