Skip to content

fix(uat): capture diagnostics when --quiet hides cell failures#720

Merged
joeharris76 merged 2 commits into
developfrom
fix/uat-fast-platform-failures
May 31, 2026
Merged

fix(uat): capture diagnostics when --quiet hides cell failures#720
joeharris76 merged 2 commits into
developfrom
fix/uat-fast-platform-failures

Conversation

@joeharris76
Copy link
Copy Markdown
Owner

@joeharris76 joeharris76 commented May 31, 2026

Summary

Fast-platform UAT sweeps produced a wall of identical, opaque failures —
status=failed terminal_state=no_json_nonzero submit_terminal_state=missing_manifest
with an empty failure_tail — across clickhouse-server (57/57 and 22/22) and
scattered duckdb / datafusion / lakesail cells. That single signature was hiding
several unrelated root causes, made indistinguishable by one diagnostic gap.

Root cause of the opacity (fixed here): UAT cells run benchbox run --quiet
so the final stdout line is the result-JSON path the runner parses. But --quiet
also suppresses benchbox's own error output, so a non-zero exit left the
per-cell log empty ((no subprocess output captured)) — every distinct failure
collapsed to the same signature.

no_json_nonzero + missing_manifest does not mean "ran but wrote no
manifest". It means benchbox itself exited non-zero: runner.py nulls
result_path whenever exit_code != 0 and forces submit_state = missing_manifest. The classifier is faithful; it just had nothing to report.

Change

  • tests/uat/matrix.py: add quiet: bool = True to benchbox_run_argv to build
    a verbose argv on demand. Default stays True, preserving result-path parsing;
    --quiet now appended conditionally (ordering preserved relative to --phases).
  • tests/uat/runner.py: when a cell exits non-zero with no captured output
    (and did not time out), re-run it once verbosely (quiet=False, stderr
    merged into stdout, capped at DIAGNOSTIC_RERUN_TIMEOUT_S=180s) via
    _append_diagnostic_rerun, appending the transcript as plain lines so
    _cell_log_tail surfaces it into failure_tail. Best-effort try/except so
    diagnostics never mask the original failure.

This makes the next sweep self-describing — real per-cell errors land in
failure_tail / cells.jsonl instead of being swallowed.

Verification

  • pytest tests/uat/test_runner.py → 24 passed; tests/uat/test_matrix.py → 185 passed
  • ruff format / ruff check / ty check clean on both files; py_compile clean
  • make pr-preflight-fast-tests passed at first push (matrix.py); re-run after
    the runner.py commit
    to validate the full change end-to-end.

Root-cause analysis & follow-ups (diagnosis-gated)

This PR fixes the diagnostic keystone. The remaining root causes need a live
docker-backed repro loop (now far easier with verbose tails). Each should be its
own follow-up:

# Platform / scope Likely cause Next step
RC-1 clickhouse-server (100%, ~7–8s, docker up=ok) Pre-load failure swallowed by --quiet; suspect uv-extra wiring or connection contract (cells pass no host/port, unlike starrocks/cedardb) Re-run one cell with this fix → read real error → fix wiring in PLATFORM_UV_EXTRA / PLATFORM_EXTRA_OPTS
RC-2 cedardb (a) Bind 0.0.0.0:5435 already allocated (stale container); (b) sweep hung mid-run on tpcds_obt 1.0 — per-cell JSONs were clean Per-cell hard wall-clock kill; docker compose down in finally; pre-up port preflight
RC-3 starrocks Same sweep-hang pattern; per-bench JSONs exist, no cells.jsonl/matrix_summary Same as RC-2
RC-4 duckdb / datafusion / lakesail (scattered) Genuine benchmark failures, correctly reported. e.g. duckdb write_primitives load: "No files found that match pattern path_…"; flightdata/datavault sf=1.0 timeouts Fix write_primitives datagen path-glob; raise timeout_s (or mark slow) for sf=1.0 outliers

Notes

  • No functional change to the benchmark execution path; harness diagnostics only.
  • The diagnostic re-run fires only on failure with empty output (rare) and is
    time-capped, so it adds no cost to passing or already-diagnosed cells.

🤖 Generated with Claude Code

UAT cells run `benchbox run --quiet` so the final stdout line is the
result-JSON path parsed by the runner. But --quiet also suppresses
benchbox's own error output, so a non-zero exit left the per-cell log
empty and every distinct failure collapsed to the same opaque signature
(no_json_nonzero / submit=missing_manifest / failure_tail="").

When a cell exits non-zero with no captured output, re-run it once
verbosely (quiet=False, stderr merged into stdout, capped at 180s) and
append the transcript as plain lines so `_cell_log_tail` surfaces it
into `failure_tail`. Add a `quiet` parameter to `benchbox_run_argv` to
build the verbose argv; the default stays True to preserve result-path
parsing and existing argv ordering.

This unblocks root-causing the fast-platform sweep failures
(clickhouse-server, scattered duckdb/datafusion/lakesail cells) by
making the next run self-describing instead of opaque.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@joeharris76 joeharris76 enabled auto-merge (squash) May 31, 2026 15:19
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 370a19ce1f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/uat/matrix.py
compression: str | None = None,
extra_args: Iterable[str] = (),
local_managed_platform: bool = False,
quiet: bool = True,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Rerun failed cells without --quiet

This only adds a quiet switch to the argv builder, but run_cell() still calls benchbox_run_argv(...) without passing it and never performs the documented non-zero retry (checked tests/uat/runner.py:129-151). In the scenario this commit targets—benchbox run --quiet exits non-zero with no stdout—the per-cell log remains just the command header and any stderr, so failure_tail is still empty and diagnostics are not captured. Wire a retry path that calls this with quiet=False when the first quiet run fails without captured output.

Useful? React with 👍 / 👎.

Completes the diagnostics fix. The prior commit added the `quiet`
parameter to `benchbox_run_argv` but nothing invoked it, leaving it
inert. This wires it up in `run_cell`: when a `--quiet` cell exits
non-zero with no captured output (and did not time out), re-run it once
verbosely via `_append_diagnostic_rerun` (stderr merged into stdout,
capped at DIAGNOSTIC_RERUN_TIMEOUT_S=180s) and append the transcript as
plain lines so `_cell_log_tail` surfaces the real error into
`failure_tail`. Best-effort try/except so diagnostics never mask the
original failure.

Verified: tests/uat/test_runner.py (24) and test_matrix.py (185) pass;
ruff format/check and ty clean; py_compile clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@joeharris76 joeharris76 merged commit 1b42667 into develop May 31, 2026
7 checks passed
@joeharris76 joeharris76 deleted the fix/uat-fast-platform-failures branch May 31, 2026 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant