feat(flows): switch default QA LLM to qwen36-deep + 1-retry safety net for agent buy#496
Merged
Conversation
The agent step at flow-13/14 step 46 sends a long single-shot prompt
to the obol-agent (qwen36-fast, ~4B params) telling it to invoke
buy.py via its terminal tool. qwen36-fast occasionally narrates a
fabricated failure (HTTP 404 path-doubling, eRPC DNS error, etc.)
instead of actually running the bash command. When that happens, no
PurchaseRequest is created and step 47 fails with "PurchaseRequest CR
not ready" — even though buy.py was never invoked.
This commit factors the prompt into agent_buy_with_retry() in
lib-dual-stack.sh and replaces both flow-13 and flow-14 step 46 with
a single call. The wrapper:
1. Sends the prompt as before.
2. Polls bob's hermes-obol-agent namespace for the alice-obol PR
for up to 60s.
3. If the PR doesn't appear, prints a LOUD warning box flagging
this as documented agent unreliability and re-sends the prompt
once.
4. If still absent, step 47 fails as before.
Net effect: probabilistic single-attempt FAILs become reliable PASSes
on real flake while still failing loudly on a real regression. The
WARN box on retry is the audit trail — if it fires regularly, the
smoke needs a more reliable LLM (qwen36-deep / qwen36-35b-heretic)
or a non-agent fallback.
Refers: plans/inference-v1337-followup-20260514.md (the v1337 buy
attempt-5 SIGKILL false-positive was the same flake class)
Saves ~50 lines of duplication between the two flow scripts.
…eep (27B)
The smaller qwen36-fast was the previous default for OBOL_LLM_MODEL across
release-smoke and flow-{03,04,11,13,14} plus buy-external. It's documented as
flaky on the long single-shot agent-buy prompt at flow-13/14 step 46 (see the
retry-wrapper rationale added in the prior commit, plus
plans/inference-v1337-followup-20260514.md).
Switching the default to qwen36-deep (27B-class, also served by the same
spark1 vLLM endpoint) trades a bit of latency for a much more reliable
tool-call behaviour. Operators can still pin the smaller model explicitly via
OBOL_LLM_MODEL=qwen36-fast for fast iteration on non-agent flows.
Files changed:
- flows/lib.sh, flows/release-smoke.sh, flows/flow-{03,04,11,13,14}*.sh,
flows/buy-external.sh — default value switch
- flows/lib-dual-stack.sh — WARN box in agent_buy_with_retry now
recommends checking the model is qwen36-deep first; mentions
qwen36-35b-heretic as the next escalation
- CLAUDE.md, .agents/skills/obol-stack-dev/{SKILL.md,references/*.md}
— documentation refreshed
Not changed (intentional):
- internal/{model,hermes}/*_test.go — qwen36-fast is a test fixture for
the rank parser, not a default; switching would invalidate test
expectations without changing test intent
- plans/post-490-integration-20260513.md — historical record
…ive)
The authorization-header-value rule fires on `-H "Authorization: Bearer
$BOB_TOKEN"` because the broad `\S+` match treats the shell variable as a
high-entropy literal. The actual token comes from $BOB_TOKEN at runtime
(set elsewhere in the flow), not from the literal source text.
Adds a narrowly scoped allowlist regex matching only the shell variable
expansion form (`$VAR` / `${VAR}`). A genuinely hardcoded Bearer string
like `Bearer abc123def456...` still trips the rule because the allowlist
regex requires a literal `$`.
Triggered on PR #496 (the agent_buy_with_retry helper inherited the
existing flow-13/14 step 46 idiom; the original sites in flow-03/04 and
buy-external are pre-existing on main and so never appeared in a PR diff
scan).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two-commit change addressing LLM reliability in flow-13 / flow-14 step 46 (the agent-driven buy prompt):
fix(flows): 1-retry wrapper for agent buy prompt in flow-13/14(7a7d51b) — factor the buy prompt intoagent_buy_with_retry()inflows/lib-dual-stack.sh; if no PurchaseRequest CR appears within 60s, print a loud WARN box and re-send the prompt once before letting step 47 fail.chore(flows): switch default QA LLM from qwen36-fast (4B) to qwen36-deep (27B)(3cea3fb) —qwen36-deep(27B-class on the same spark1 vLLM endpoint) becomes the defaultOBOL_LLM_MODELacross release-smoke and the flow scripts. The smallerqwen36-fastwas flaking on the long single-shot agent-buy prompt; the bigger model is much more reliable on tool calls. Operators can still pinOBOL_LLM_MODEL=qwen36-fastexplicitly when iterating fast on non-agent flows.Background
Today's release-smoke on main tip
7850332failed with 11/13 PASS, 2 FAIL — both in the agent-driven buy flows. The agent (qwen36-fast, ~4B) narrated fabricated failure reasons (HTTP 404 path doubling in flow-13, eRPC DNS error in flow-14) and never actually invokedbuy.py. A back-to-back re-run with the same code passed cleanly 2/2 — confirmed agent flake, not regression.This PR addresses both layers: prevention (use the more reliable model by default) and defense in depth (retry once with a loud audit trail when flake recurs).
Net behavior
retry=0in pass message)retry=1)The WARN box names the next escalation steps explicitly (verify
OBOL_LLM_MODEL=qwen36-deepis set, escalate toqwen36-35b-heretic, or add a non-agent fallback path) so the next debugger has a clear ladder.Files changed
flows/lib-dual-stack.sh+agent_buy_with_retry()+ private_agent_buy_send_prompt()/_agent_buy_pr_exists()helpers; WARN box content updated for the new defaultflows/flow-13-dual-stack-obol.shagent_buy_with_retry; default + assertion comment updatedflows/flow-14-live-obol-base-sepolia.shagent_buy_with_retry; default updatedflows/flow-{03,04,11}*.sh,flows/buy-external.sh,flows/lib.sh,flows/release-smoke.shOBOL_LLM_MODELvalue switchCLAUDE.md,.agents/skills/obol-stack-dev/{SKILL.md,references/*.md}Not touched on purpose:
internal/{model,hermes}/*_test.gouseqwen36-fastas a test fixture for the rank parser — not a default; flipping it would invalidate test expectations without changing test intent.plans/post-490-integration-20260513.mdkeeps the old default in its historical narrative.Test plan
bash -nclean across all 9 touched flow scriptsqwen36-fastreference outside intentional historical/explanatory contextOBOL_LLM_MODEL=qwen36-fastand confirming the WARN box fires + the retry recovers