feat(flows): switch default QA LLM to qwen36-deep + 1-retry safety net for agent buy by bussyjd · Pull Request #496 · ObolNetwork/obol-stack

bussyjd · 2026-05-15T06:26:25Z

Summary

Two-commit change addressing LLM reliability in flow-13 / flow-14 step 46 (the agent-driven buy prompt):

fix(flows): 1-retry wrapper for agent buy prompt in flow-13/14 (7a7d51b) — factor the buy prompt into agent_buy_with_retry() in flows/lib-dual-stack.sh; if no PurchaseRequest CR appears within 60s, print a loud WARN box and re-send the prompt once before letting step 47 fail.
chore(flows): switch default QA LLM from qwen36-fast (4B) to qwen36-deep (27B) (3cea3fb) — qwen36-deep (27B-class on the same spark1 vLLM endpoint) becomes the default OBOL_LLM_MODEL across release-smoke and the flow scripts. The smaller qwen36-fast was flaking on the long single-shot agent-buy prompt; the bigger model is much more reliable on tool calls. Operators can still pin OBOL_LLM_MODEL=qwen36-fast explicitly when iterating fast on non-agent flows.

Background

Today's release-smoke on main tip 7850332 failed with 11/13 PASS, 2 FAIL — both in the agent-driven buy flows. The agent (qwen36-fast, ~4B) narrated fabricated failure reasons (HTTP 404 path doubling in flow-13, eRPC DNS error in flow-14) and never actually invoked buy.py. A back-to-back re-run with the same code passed cleanly 2/2 — confirmed agent flake, not regression.

This PR addresses both layers: prevention (use the more reliable model by default) and defense in depth (retry once with a loud audit trail when flake recurs).

Net behavior

	Before	After
Default model	qwen36-fast (4B)	qwen36-deep (27B)
Healthy run	PASS	PASS (`retry=0` in pass message)
Agent flake (reduced rate)	FAIL	PASS with loud WARN box (`retry=1`)
Real code regression	FAIL	FAIL after 2 attempts (~3min vs 1min — acceptable trade)

The WARN box names the next escalation steps explicitly (verify OBOL_LLM_MODEL=qwen36-deep is set, escalate to qwen36-35b-heretic, or add a non-agent fallback path) so the next debugger has a clear ladder.

Files changed

File	Change
`flows/lib-dual-stack.sh`	`+agent_buy_with_retry()` + private `_agent_buy_send_prompt()` / `_agent_buy_pr_exists()` helpers; WARN box content updated for the new default
`flows/flow-13-dual-stack-obol.sh`	step 46 collapsed to `agent_buy_with_retry`; default + assertion comment updated
`flows/flow-14-live-obol-base-sepolia.sh`	step 46 collapsed to `agent_buy_with_retry`; default updated
`flows/flow-{03,04,11}*.sh`, `flows/buy-external.sh`, `flows/lib.sh`, `flows/release-smoke.sh`	default `OBOL_LLM_MODEL` value switch
`CLAUDE.md`, `.agents/skills/obol-stack-dev/{SKILL.md,references/*.md}`	doc refresh

Not touched on purpose: internal/{model,hermes}/*_test.go use qwen36-fast as a test fixture for the rank parser — not a default; flipping it would invalidate test expectations without changing test intent. plans/post-490-integration-20260513.md keeps the old default in its historical narrative.

Test plan

bash -n clean across all 9 touched flow scripts
No remaining qwen36-fast reference outside intentional historical/explanatory context
spark1 release-smoke against this branch — both flow-13 and flow-14 should PASS first try with the WARN box absent (deep is reliable + the wrapper is a safety net)
(optional) Force a flake by temporarily overriding OBOL_LLM_MODEL=qwen36-fast and confirming the WARN box fires + the retry recovers

The agent step at flow-13/14 step 46 sends a long single-shot prompt to the obol-agent (qwen36-fast, ~4B params) telling it to invoke buy.py via its terminal tool. qwen36-fast occasionally narrates a fabricated failure (HTTP 404 path-doubling, eRPC DNS error, etc.) instead of actually running the bash command. When that happens, no PurchaseRequest is created and step 47 fails with "PurchaseRequest CR not ready" — even though buy.py was never invoked. This commit factors the prompt into agent_buy_with_retry() in lib-dual-stack.sh and replaces both flow-13 and flow-14 step 46 with a single call. The wrapper: 1. Sends the prompt as before. 2. Polls bob's hermes-obol-agent namespace for the alice-obol PR for up to 60s. 3. If the PR doesn't appear, prints a LOUD warning box flagging this as documented agent unreliability and re-sends the prompt once. 4. If still absent, step 47 fails as before. Net effect: probabilistic single-attempt FAILs become reliable PASSes on real flake while still failing loudly on a real regression. The WARN box on retry is the audit trail — if it fires regularly, the smoke needs a more reliable LLM (qwen36-deep / qwen36-35b-heretic) or a non-agent fallback. Refers: plans/inference-v1337-followup-20260514.md (the v1337 buy attempt-5 SIGKILL false-positive was the same flake class) Saves ~50 lines of duplication between the two flow scripts.

…eep (27B) The smaller qwen36-fast was the previous default for OBOL_LLM_MODEL across release-smoke and flow-{03,04,11,13,14} plus buy-external. It's documented as flaky on the long single-shot agent-buy prompt at flow-13/14 step 46 (see the retry-wrapper rationale added in the prior commit, plus plans/inference-v1337-followup-20260514.md). Switching the default to qwen36-deep (27B-class, also served by the same spark1 vLLM endpoint) trades a bit of latency for a much more reliable tool-call behaviour. Operators can still pin the smaller model explicitly via OBOL_LLM_MODEL=qwen36-fast for fast iteration on non-agent flows. Files changed: - flows/lib.sh, flows/release-smoke.sh, flows/flow-{03,04,11,13,14}*.sh, flows/buy-external.sh — default value switch - flows/lib-dual-stack.sh — WARN box in agent_buy_with_retry now recommends checking the model is qwen36-deep first; mentions qwen36-35b-heretic as the next escalation - CLAUDE.md, .agents/skills/obol-stack-dev/{SKILL.md,references/*.md} — documentation refreshed Not changed (intentional): - internal/{model,hermes}/*_test.go — qwen36-fast is a test fixture for the rank parser, not a default; switching would invalidate test expectations without changing test intent - plans/post-490-integration-20260513.md — historical record

…ive) The authorization-header-value rule fires on `-H "Authorization: Bearer $BOB_TOKEN"` because the broad `\S+` match treats the shell variable as a high-entropy literal. The actual token comes from $BOB_TOKEN at runtime (set elsewhere in the flow), not from the literal source text. Adds a narrowly scoped allowlist regex matching only the shell variable expansion form (`$VAR` / `${VAR}`). A genuinely hardcoded Bearer string like `Bearer abc123def456...` still trips the rule because the allowlist regex requires a literal `$`. Triggered on PR #496 (the agent_buy_with_retry helper inherited the existing flow-13/14 step 46 idiom; the original sites in flow-03/04 and buy-external are pre-existing on main and so never appeared in a PR diff scan).

bussyjd added 2 commits May 15, 2026 14:25

bussyjd changed the title ~~fix(flows): 1-retry wrapper for agent buy prompt in flow-13/14~~ feat(flows): switch default QA LLM to qwen36-deep + 1-retry safety net for agent buy May 15, 2026

bussyjd merged commit e2a17a4 into main May 15, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(flows): switch default QA LLM to qwen36-deep + 1-retry safety net for agent buy#496

feat(flows): switch default QA LLM to qwen36-deep + 1-retry safety net for agent buy#496
bussyjd merged 3 commits into
mainfrom
fix/agent-buy-retry-wrapper

bussyjd commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bussyjd commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Net behavior

Files changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bussyjd commented May 15, 2026 •

edited

Loading