Skip to content

feat(flows): switch default QA LLM to qwen36-deep + 1-retry safety net for agent buy#496

Merged
bussyjd merged 3 commits into
mainfrom
fix/agent-buy-retry-wrapper
May 15, 2026
Merged

feat(flows): switch default QA LLM to qwen36-deep + 1-retry safety net for agent buy#496
bussyjd merged 3 commits into
mainfrom
fix/agent-buy-retry-wrapper

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 15, 2026

Summary

Two-commit change addressing LLM reliability in flow-13 / flow-14 step 46 (the agent-driven buy prompt):

  1. fix(flows): 1-retry wrapper for agent buy prompt in flow-13/14 (7a7d51b) — factor the buy prompt into agent_buy_with_retry() in flows/lib-dual-stack.sh; if no PurchaseRequest CR appears within 60s, print a loud WARN box and re-send the prompt once before letting step 47 fail.

  2. chore(flows): switch default QA LLM from qwen36-fast (4B) to qwen36-deep (27B) (3cea3fb) — qwen36-deep (27B-class on the same spark1 vLLM endpoint) becomes the default OBOL_LLM_MODEL across release-smoke and the flow scripts. The smaller qwen36-fast was flaking on the long single-shot agent-buy prompt; the bigger model is much more reliable on tool calls. Operators can still pin OBOL_LLM_MODEL=qwen36-fast explicitly when iterating fast on non-agent flows.

Background

Today's release-smoke on main tip 7850332 failed with 11/13 PASS, 2 FAIL — both in the agent-driven buy flows. The agent (qwen36-fast, ~4B) narrated fabricated failure reasons (HTTP 404 path doubling in flow-13, eRPC DNS error in flow-14) and never actually invoked buy.py. A back-to-back re-run with the same code passed cleanly 2/2 — confirmed agent flake, not regression.

This PR addresses both layers: prevention (use the more reliable model by default) and defense in depth (retry once with a loud audit trail when flake recurs).

Net behavior

Before After
Default model qwen36-fast (4B) qwen36-deep (27B)
Healthy run PASS PASS (retry=0 in pass message)
Agent flake (reduced rate) FAIL PASS with loud WARN box (retry=1)
Real code regression FAIL FAIL after 2 attempts (~3min vs 1min — acceptable trade)

The WARN box names the next escalation steps explicitly (verify OBOL_LLM_MODEL=qwen36-deep is set, escalate to qwen36-35b-heretic, or add a non-agent fallback path) so the next debugger has a clear ladder.

Files changed

File Change
flows/lib-dual-stack.sh +agent_buy_with_retry() + private _agent_buy_send_prompt() / _agent_buy_pr_exists() helpers; WARN box content updated for the new default
flows/flow-13-dual-stack-obol.sh step 46 collapsed to agent_buy_with_retry; default + assertion comment updated
flows/flow-14-live-obol-base-sepolia.sh step 46 collapsed to agent_buy_with_retry; default updated
flows/flow-{03,04,11}*.sh, flows/buy-external.sh, flows/lib.sh, flows/release-smoke.sh default OBOL_LLM_MODEL value switch
CLAUDE.md, .agents/skills/obol-stack-dev/{SKILL.md,references/*.md} doc refresh

Not touched on purpose: internal/{model,hermes}/*_test.go use qwen36-fast as a test fixture for the rank parser — not a default; flipping it would invalidate test expectations without changing test intent. plans/post-490-integration-20260513.md keeps the old default in its historical narrative.

Test plan

  • bash -n clean across all 9 touched flow scripts
  • No remaining qwen36-fast reference outside intentional historical/explanatory context
  • spark1 release-smoke against this branch — both flow-13 and flow-14 should PASS first try with the WARN box absent (deep is reliable + the wrapper is a safety net)
  • (optional) Force a flake by temporarily overriding OBOL_LLM_MODEL=qwen36-fast and confirming the WARN box fires + the retry recovers

bussyjd added 2 commits May 15, 2026 14:25
The agent step at flow-13/14 step 46 sends a long single-shot prompt
to the obol-agent (qwen36-fast, ~4B params) telling it to invoke
buy.py via its terminal tool. qwen36-fast occasionally narrates a
fabricated failure (HTTP 404 path-doubling, eRPC DNS error, etc.)
instead of actually running the bash command. When that happens, no
PurchaseRequest is created and step 47 fails with "PurchaseRequest CR
not ready" — even though buy.py was never invoked.

This commit factors the prompt into agent_buy_with_retry() in
lib-dual-stack.sh and replaces both flow-13 and flow-14 step 46 with
a single call. The wrapper:

  1. Sends the prompt as before.
  2. Polls bob's hermes-obol-agent namespace for the alice-obol PR
     for up to 60s.
  3. If the PR doesn't appear, prints a LOUD warning box flagging
     this as documented agent unreliability and re-sends the prompt
     once.
  4. If still absent, step 47 fails as before.

Net effect: probabilistic single-attempt FAILs become reliable PASSes
on real flake while still failing loudly on a real regression. The
WARN box on retry is the audit trail — if it fires regularly, the
smoke needs a more reliable LLM (qwen36-deep / qwen36-35b-heretic)
or a non-agent fallback.

Refers: plans/inference-v1337-followup-20260514.md (the v1337 buy
attempt-5 SIGKILL false-positive was the same flake class)

Saves ~50 lines of duplication between the two flow scripts.
…eep (27B)

The smaller qwen36-fast was the previous default for OBOL_LLM_MODEL across
release-smoke and flow-{03,04,11,13,14} plus buy-external. It's documented as
flaky on the long single-shot agent-buy prompt at flow-13/14 step 46 (see the
retry-wrapper rationale added in the prior commit, plus
plans/inference-v1337-followup-20260514.md).

Switching the default to qwen36-deep (27B-class, also served by the same
spark1 vLLM endpoint) trades a bit of latency for a much more reliable
tool-call behaviour. Operators can still pin the smaller model explicitly via
OBOL_LLM_MODEL=qwen36-fast for fast iteration on non-agent flows.

Files changed:
- flows/lib.sh, flows/release-smoke.sh, flows/flow-{03,04,11,13,14}*.sh,
  flows/buy-external.sh — default value switch
- flows/lib-dual-stack.sh — WARN box in agent_buy_with_retry now
  recommends checking the model is qwen36-deep first; mentions
  qwen36-35b-heretic as the next escalation
- CLAUDE.md, .agents/skills/obol-stack-dev/{SKILL.md,references/*.md}
  — documentation refreshed

Not changed (intentional):
- internal/{model,hermes}/*_test.go — qwen36-fast is a test fixture for
  the rank parser, not a default; switching would invalidate test
  expectations without changing test intent
- plans/post-490-integration-20260513.md — historical record
@bussyjd bussyjd changed the title fix(flows): 1-retry wrapper for agent buy prompt in flow-13/14 feat(flows): switch default QA LLM to qwen36-deep + 1-retry safety net for agent buy May 15, 2026
…ive)

The authorization-header-value rule fires on `-H "Authorization: Bearer
$BOB_TOKEN"` because the broad `\S+` match treats the shell variable as a
high-entropy literal. The actual token comes from $BOB_TOKEN at runtime
(set elsewhere in the flow), not from the literal source text.

Adds a narrowly scoped allowlist regex matching only the shell variable
expansion form (`$VAR` / `${VAR}`). A genuinely hardcoded Bearer string
like `Bearer abc123def456...` still trips the rule because the allowlist
regex requires a literal `$`.

Triggered on PR #496 (the agent_buy_with_retry helper inherited the
existing flow-13/14 step 46 idiom; the original sites in flow-03/04 and
buy-external are pre-existing on main and so never appeared in a PR diff
scan).
@bussyjd bussyjd merged commit e2a17a4 into main May 15, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant