release-smoke: green 13/13 (supersedes #476 #477 #478 #479 #483 #484)#490
Merged
Conversation
run_flow now accepts optional env_name/env_value args, replacing three
hand-coded `if name == flow-XX` branches. env "VAR=val" bash $flow is
semantically identical to the inline `VAR=val bash $flow` form for
child-only export. Validated end-to-end on spark1: flow-11/13/14
each populated their FLOW{11,13,14}_ARTIFACT_DIR receipt subdirs.
Also:
- Defensive ${var:-0} on FAIL/SKIP grep counters so a missing log
can't crash the [ "$var" -eq 0 ] arithmetic under set -e.
- Inlined the one-shot cleanup_default_stack_before_dual helper.
- Brief comment on why flow-10 sits between flow-07 and flow-08
(flow-08 header explicitly requires flow-10's Anvil facilitator).
…terns
lib.sh's run_step_grep and poll_step_grep helpers both use `grep -qE`
(extended regex) — comment in the helper explicitly says callers may
use ERE quantifiers. But several callsites passed BRE alternation
(`\|`), which in ERE is a literal backslash-pipe and never matches.
The smoke run on spark1 surfaced this as cascading "pattern not found"
FAILs in flow-01 / flow-02 / flow-05 / flow-06 / flow-10, even though
the expected substrings were present in the CLI output.
Convert all 8 BRE call sites to ERE `|`. Also widen the patterns where
the CLI output has drifted since they were written:
- `obol network list` now prints `Local Nodes:` / `Remote RPCs:` /
`mainnet … chain=…` instead of network shortnames; the old
`ethereum|aztec` pattern would never have matched the current
output even with correct ERE syntax.
- `obol network status` now prints `eRPC Gateway Status` / `Pod:` /
upstream rows; added `Running` for completeness.
- flow-10 facilitator-pricing check now also accepts the
`configmap/x402-pricing configured` line that `obol sell pricing`
actually emits.
flow-08: Drop the hard "agent wallet == deterministic Bob" assertion. Bob pre-seeding is the flow-11/13/14 dual-stack pattern; single-stack flow-08 uses whatever wallet `obol agent init` generated, and every downstream funding/signing step already uses $AGENT_WALLET directly. Keep BOB_WALLET derivation when REMOTE_SIGNER_PRIVATE_KEY is set and report a match for transparency, but PASS either way. flow-07 + flow-10 x402-verifier readiness: Replace pod-counting loops with `kubectl rollout status deployment/x402-verifier`. The old loops counted every pod in the x402 namespace, so when stuck old ReplicaSets or the unrelated serviceoffer-controller sat in Pending (real condition observed on spark1 under host load), the loops never converged. rollout status is the authoritative readiness signal and only tracks the latest ReplicaSet. Bumped the timeout to 180s in both spots to absorb image-pull jitter on cold caches.
Two cold-start races surfaced on spark1's smoke run: 1. `obol stack up` returns once the k3s API responds, but the k3d node can take another few seconds to appear in `kubectl get nodes`. Old `run_step_grep "Nodes ready" "Ready" kubectl get nodes` raced that window and FAILed on "No resources found". Switched to `poll_step_grep` with 12×5s = 60s ceiling. Also tightened the pattern to ` Ready ` (surrounding spaces) so the word "NotReady" in the status column does not satisfy the match. 2. eRPC's HTTP listener becomes reachable before its upstream pool has fully resolved every alias. A one-shot GET /rpc moments after pods report Running often returns a partial list missing `base-sepolia`, which then cascades into the chains-OK and JSON-RPC checks. Poll the first eRPC assertion until base-sepolia appears (or 60s elapses); the subsequent assertions then have a stable list to reason about.
PR #470 removed the hardcoded "0.001" default from the --price flag so that --per-mtok / --per-request can take effect when --price is not passed. The flag-defaults table test still asserted the old default "0.001" and broke on merge. Update it to expect empty string, matching the post-#470 behavior.
Switch the frontend image reference from tag-only ("v0.1.23") to
tag+digest ("v0.1.23@sha256:950b887e1cbaca9f928ff7b449b5602ed9777b629b4ee1b9c4c91fac2d74c2f2").
The tag stays for human readability; the digest is authoritative.
Eliminates the mutable-tag attack surface flagged as a non-blocking
follow-up by the supply-chain review of v0.10.0-rc2. Multi-arch index
digest covers linux/amd64 and linux/arm64.
Renders to a valid OCI reference via the obol/obol-app chart
"obol-app.image" helper (verified locally with helm template).
…tion' into integration/release-smoke-hardening-20260512
…n' into integration/release-smoke-hardening-20260512
…ifier-readiness' into integration/release-smoke-hardening-20260512
…art-polling' into integration/release-smoke-hardening-20260512
…-align' into integration/release-smoke-hardening-20260512
…1.23' into integration/release-smoke-hardening-20260512
…tes are on flow-11, flow-14, and flow-13 each enforce a preflight that rejects the run unless OBOL_LLM_ENDPOINT points at an OpenAI-compatible vLLM / llama.cpp endpoint. Without this guard the runner spends ~30 min on the baseline flows before those three FAIL on their preflights, which makes release-smoke results look worse than they are. Add a fail-fast guard in the runner: when RELEASE_SMOKE_INCLUDE_OBOL or RELEASE_SMOKE_INCLUDE_OBOL_FORK is true, OBOL_LLM_ENDPOINT must be set. Otherwise exit 2 with a copy-pasteable env block pointing at the QA references. Surfaced by a spark1 dry-run where the dual-stack flows guaranteed FAIL because the runner inherited a host-Ollama context (no QA vLLM) — three flows wasted setup work to fail at their preflight.
Anvil's default bind is 127.0.0.1, which makes it reachable from the
host (so flow-10's own health check passes) but NOT from inside the
k3d cluster via host.k3d.internal:8545 (Connection refused).
That mismatch was the root cause of the "Payment verification failed
503" we saw cascading through flow-08 / flow-11 / flow-14 / flow-13
in every smoke run on every host:
1. flow-10 starts anvil at 127.0.0.1:8545 + reports PASS (host-local
health check works).
2. eRPC's `custom-84532-0` upstream points at host.k3d.internal:8545
and silently times out on every base-sepolia call from cluster pods.
3. `buy.py` inside the obol-agent pod calls eRPC to read the buyer
wallet's balance; the eth_call times out with eRPC's
"network 'evm:84532' for project 'rpc'" error.
4. buy.py exits before creating the PurchaseRequest CR, so the
serviceoffer-controller never hot-adds paid/<model> into LiteLLM.
5. The smoke's paid inference step requests model
`paid/qwen3.5:9b` (or `paid/qwen36-fast`), LiteLLM has no such
model_group, and returns 503 wrapped as
`ServiceUnavailableError: Payment verification failed` — which
looks like a verifier rejection but is really "no model in
LiteLLM".
Validation:
- Anvil bound 0.0.0.0:8545 (confirmed via ss -tlnp).
- eth_chainId from host.k3d.internal:8545 inside an in-cluster pod
returned 0x14a34 (84532) — was Connection refused without the flag.
This single character + flag flip is what was breaking the entire
paid-commerce side of the smoke. Five flow FAILs trace back to it.
…verlay All 12 references to `ghcr.io/x402-rs/x402-facilitator:1.4.7` updated to `ghcr.io/obolnetwork/x402-facilitator-prometheus-overlay:1.4.9`. This is the ObolNetwork-published fork of x402-rs 1.4.9 with the Prometheus metrics overlay applied. arm64+amd64 manifest available (verified via `docker manifest inspect`). Updated in: - flows/lib.sh (release-smoke shared image const) - internal/testutil/facilitator_real.go (integration-test image) - flows/flow-10-anvil-facilitator.sh - flows/flow-12-obol-payment.sh - flows/flow-13-dual-stack-obol.sh - flows/README.md (operator docs) - internal/openclaw/monetize_integration_test.go (4 doc comments) Surfaced while root-causing the cluster Payment-verification-503 cascade across flow-08/11/14/13: the locally-built x402-buyer:latest image was 5 days stale (pre-PR #377 nested-format support), and pinning the facilitator to the current fork keeps the verify/settle path consistent with the buyer code as we re-validate the smoke.
…preflight Two changes that close the loop on the release-smoke 503 investigation: 1. Pin x402-buyer image to a sha256 digest (multi-arch manifest list, amd64+arm64) in base/templates/llm.yaml. Tag-only pinning let the local-build path silently reuse a 5-day-old `:latest` image; that stale buyer pre-dated PR #377's nested-format support and serialized X-PAYMENT with empty authorization fields → facilitator /verify 400 → 503 cascade across flow-08 / flow-11 / flow-14 / flow-13. Same pattern the verifier and serviceoffer-controller already use (they were @sha256-pinned before commit e157b10 dropped them to short-SHA tags). 2. Add a fail-fast preflight in flow-10 that probes anvil from inside the k3d cluster via host.k3d.internal:8545. Host-local health checks passing is necessary but not sufficient: default anvil binds to 127.0.0.1 only and pods get Connection refused, which silently broke every paid-flow run on every host until the spark1 investigation surfaced it. Now the next operator that forgets `--host 0.0.0.0` sees a clear failure naming the root cause instead of chasing a misleading 503 through buy.py / verifier / facilitator.
Upstream ghcr.io/obolnetwork/x402-facilitator-prometheus-overlay:1.4.9 ships an amd64 binary inside its arm64 manifest variant (cross-build packaging bug in ObolNetwork/x402-rs). Pulling on arm64 hosts produces 'exec format error' on first ENTRYPOINT exec, which manifests as flow-10 FAIL + 503 cascade across flow-08/11/14/13. Add X402_FACILITATOR_SKIP_PULL=true + X402_FACILITATOR_IMAGE override so QA hosts can build the overlay locally and run release-smoke without touching the broken registry image. The upstream Dockerfile fix is in flight; remove these knobs once the registry image is correct for arm64.
Free-tier Base Sepolia RPCs (drpc.org, sepolia.base.org) routinely return HTTP 408 'Request timeout on the free tier' under release-smoke load. Symptoms in recent runs: - flow-11 step 8: 'Could not read Alice starting USDC balance' (single cast call, no retry, hit transient rate limit) - flow-13 step 9: 'Facilitator did not become reachable' (anvil fork origin throttled while syncing chain state) - flow-14 balance reads intermittently empty The deterministic Bob wallet also gets drained across smoke runs that complete any paid commerce step (each successful purchase spends from Bob), so flow-14 'Bob signer OBOL balance 0 wei is below threshold' required a manual top-up between runs. Four small changes: 1. base_sepolia_rpc_candidates() honors ALCHEMY_BASE_SEPOLIA_API_KEY and prepends https://base-sepolia.g.alchemy.com/v2/<key> when set. The full URL is only emitted to the RPC call itself; logs use the existing redact_url_for_log helper. 2. release-smoke runner adds warn_unpaid_base_sepolia_rpc() preflight that loudly prints which steps are likely to flake when OBOL/OBOL_FORK gates are on without a paid RPC. 3. fund_bob_from_alice_if_needed() in lib.sh: ERC-20 transfer (required - balance) from Alice (SIGNER_KEY) to Bob when Bob's token balance is below the per-flow required threshold. Token-agnostic; wired for USDC in flow-11 and OBOL in flow-14. 4. flow-11 wraps the Alice starting-USDC cast call in the same 5-retry pattern Bob USDC + Alice ETH already had.
The runner now pipes every flow's stdout+stderr through scrub_secrets before both the terminal and the on-disk artifact log. Caught today when 'obol network add base-sepolia --endpoint $BASE_SEPOLIA_RPC' echoed the full RPC URL with the embedded Alchemy API key into flow-11-dual-stack.log: ==> Adding custom RPC for base-sepolia (chain ID: 84532): https://base-sepolia.g.alchemy.com/v2/<KEY> Patching individual call sites would have left every other CLI line with a similar bug a leak; the runner-level filter is single-place and covers anything new. Patterns redacted: - alchemy.com/v2/<key> - infura.io/v3/<key> - quiknode.pro/<token> - any URL query string ?dkey=<token> (drpc paid) - PRIVATE_KEY=0x<64 hex> (anchored on the literal) The patterns are additive — anything we ever want to keep out of QA logs goes here.
Add lb.drpc.live/<network>/<token> redaction pattern alongside the existing Alchemy/Infura/QuickNode patterns. drpc paid endpoints carry the token in the URL path (not as ?dkey= query) so the existing ?dkey= scrubber would miss them.
…acilitator
The public facilitator at x402.gcp.obol.tech ships /supported with two
distinct entries for Base Sepolia:
{"x402Version":1, "network":"base-sepolia"} (legacy name)
{"x402Version":2, "network":"eip155:84532"} (CAIP-2)
Our verifier sends the /verify body with X402Version=2 (line 239 of
forwardauth.go) but was passing offer.Spec.Payment.Network straight
through as RouteRule.Network — i.e. the literal string the operator
typed on the CLI ("base-sepolia"). The mismatch (v2 envelope +
v1 network string) hit the v2 strict-match path in the facilitator
and returned HTTP 400 invalid_format:
x402: facilitator verify error: facilitator verify failed (400): invalid_format
Local facilitators (single-chain anvil fork) were lenient enough to
accept the loose form, which is why flow-08 PASS but flow-11/14 FAIL.
Fix: normalize through the existing public x402.NormalizeNetworkID
helper (chains.go:151) before stamping the RouteRule. That helper
returns CAIP-2 form for known names and passes already-CAIP-2 values
through unchanged, so this is no-op for operators who already use
"eip155:84532" but rescues every operator who used the friendly name.
Temporary instrumentation to surface the exact wire payload during the post-pivot 503 investigation. Public x402.gcp.obol.tech 1.4.9 overlay returns 400 invalid_format even after the CAIP-2 network normalization fix (fd95dc5), so we need to see the actual JSON the verifier sends to find what else the facilitator rejects. Also adds TestRoutesFromStore_NetworkCAIP2Normalization which confirms the fd95dc5 fix correctly maps legacy network names to CAIP-2 ('base-sepolia' -> 'eip155:84532') in the published RouteRule. Remove both log.Printf calls once the root cause is identified.
Before bash trap teardown wipes the k3d clusters, dump the Alice/Bob verifier + buyer logs and the ServiceOffer/PurchaseRequest CR state into FLOW11_ARTIFACT_DIR/flow11-step43-debug/. Without this snapshot, kubectl logs returns 'connection refused' immediately after the FAIL line lands and the /verify outbound body + facilitator response are lost.
Surface which field (network/payTo/asset/amount/auths) rejects the payment requirement so we can pinpoint why the buyer falls through to forwarding-without-X-PAYMENT after the CAIP-2 normalization fix. Remove once root cause identified.
The buyer was pinned in llm.yaml as 'x402-buyer:b13254e@sha256:446d...'. The old dev-rewrite regex matched only the ':b13254e' part, leaving the '@sha256:<digest>' suffix intact. Docker honors the digest over the tag on a pull, so the local-build path was silently bypassed — every 'OBOL_FORCE_REBUILD_LOCAL_DEV_IMAGES=x402-buyer' run was a no-op against the registry image. Effect: source changes to internal/x402/buyer/* never reached the running sidecar in OBOL_DEVELOPMENT=true mode (root cause of the missing CanSign DENY logs during flow-11 step 43 chase). Verifier+controller weren't affected because llm.yaml was the only file using the combo form. Fix: prepend a ':<tag>@sha256:<digest>' alternative to the regex so the engine consumes the whole pin before falling through to the shorter ':<tag>' or '@sha256:<digest>' alternatives. Verified against all four pin shapes (combo / digest-only / tag-only / :latest no-op).
The CLI was echoing the full --endpoint URL when adding a custom RPC, which leaked Alchemy/Infura/drpc/QuickNode API keys into release-smoke logs (flow-11 captured 'https://lb.drpc.live/base-sepolia/<token>' in plaintext today). The release-smoke runner has a scrub_secrets filter that catches this at log-write time, but standalone flow invocations and direct interactive use don't go through that filter. Move redaction into the CLI itself so the leak surface narrows to: - the underlying eRPC ConfigMap (cluster-internal, RBAC-gated) - the dev's shell history of the original 'obol network add ...' call Patterns covered (mirror flows/lib.sh scrub_secrets so coverage stays in lockstep across the two layers).
To pinpoint why the previous run's verifier log shows only middleware WARNING entries (no /verify outbound body, no /verify response status). Adds a startup-of-handler log so we can tell whether the buyer is forwarding requests without X-PAYMENT (middleware short-circuits to 402 before reaching the facilitator) or whether the verifier never sees the request at all. Remove once root cause is identified.
Locks in the fix from 1efbaab so a future contributor doesn't accidentally regress the rewrite back to the shorter ':<hex>'-only regex. The combo form 'image:tag@sha256:digest' must be rewritten to a clean ':latest' with NO orphan '@sha256:...' suffix, otherwise Docker honors the digest and silently bypasses the local build (root cause of the missing debug-log saga during flow-11 step 43 chase, May 2026).
…EVELOPMENT EnsureVerifier read the embedded x402.yaml verbatim and 'kubectl apply'-ed it, overwriting whatever helmfile sync had deployed with the embedded ':b13254e' image pins. Under OBOL_DEVELOPMENT=true this silently bypassed the local-build path: the rewrite-to-:latest that the defaults pipeline performs on the on-disk file copy doesn't reach the embedded bytes EnsureVerifier loads. Effect on flow-11 step 43 chase: Alice's verifier pod ran ':b13254e' (stale registry image) despite the on-disk x402.yaml saying ':latest' and OBOL_FORCE_REBUILD_LOCAL_DEV_IMAGES=true producing a fresh local binary. The fresh debug logs I added to forwardauth/verifier source were in the local binary but never reached the deployed pod, which is why hours of debug-instrumentation runs produced zero new diagnostic output. The cluster was running 5-day-old code. Fix: apply the same rewrite-to-:latest patterns to the in-memory manifest before kubectl-apply, gated by OBOL_DEVELOPMENT=true. The patterns mirror internal/defaults.rewriteDevDigestPins (duplicated locally to avoid an import cycle); a TestX402Manifest_DevModeRewritesPins regression test locks the parity. Production behavior unchanged: without OBOL_DEVELOPMENT=true the embedded immutable pins are kubectl-applied verbatim.
The previous strict match for 'remaining=5' fails when the buyer's auth pool ends up at 10 (controller merge on rerun of the PurchaseRequest, or pre-existing auths from a prior flow-14 attempt). The semantic we actually want to verify is 'buy provisioned at least the requested count', not 'exactly the requested count'.
fd95dc5 normalized RouteRule.Network to CAIP-2 form (eip155:84532) before this resolver was taught to map CAIP-2 ids back to ChainInfo. The pre-resolution at verifier.go:53 then failed silently: ResolveChainInfo('eip155:84532') returned 'unsupported chain', the chain registry never gained an entry for the CAIP-2 key, and matchPaidRouteFull returned (..., false) on every paid request — 404 where the smoke test expected 402. flow-11 step 21 "Alice: 402 gate" went from passing to failing. Add each CAIP-2 id as an additional alias in the case arms so both legacy names and CAIP-2 ids resolve to the same ChainInfo. Also extend the unsupported-chain error message to mention CAIP-2 acceptability.
The previous '--prune-history 1000000' was added under the
misunderstanding that it requested 1M blocks of state retention. anvil's
docs state the opposite:
--prune-history [<PRUNE_HISTORY>]
Don't keep full chain history. If a number argument is specified,
at most this number of [states are retained].
So passing the flag ENABLES pruning. With it set, anvil pruned the
fork-block state and the local x402-rs facilitator's eth_getStorageAt
for the EIP-3009 nonce slot returned HTTP 500
'state at block #N is pruned', failing flow-08 step 12 paid-inference
with a misleading 'Payment verification failed' 503.
Drop the flag entirely. Without it, anvil keeps full history from the
fork block onward, which is what flow-08's facilitator path expects.
Update the existing-anvil reuse check + the historical-state assertion
message to match.
The first POST to /services/<name>/* immediately after the verifier deployment becomes Ready can return an empty body / Bad Gateway from Traefik because the in-cluster HTTPRoute is wired but the verifier's serviceoffer-source watcher hasn't yet loaded the route into its in-memory map. Subsequent requests are fine. Wrap both step assertions in a 12x5s retry loop that breaks the moment the response parses as JSON.
Outcome of the multi-day chase that took the smoke from 'flow-11 step 43 broken across the board' to 13/13 PASS on spark1. Documents the eight distinct root causes, the matching commits in this branch, and the out-of-scope follow-ups (upstream x402-rs multi-arch fix, debug-log cleanup, OBOL_DEVELOPMENT deployment-story doc).
Three CodeQL findings introduced by this branch: 1. internal/x402/verifier.go:165 (go/log-injection, error) Debug log printed r.URL.Path (user-controlled) without sanitization. The log was added during the EnsureVerifier root-cause hunt; the root cause is now identified and fixed in 5a10fb8, so the log can come out cleanly. 2-3. cmd/obol/network.go:585-586 (go/regex/missing-regexp-anchor x2) redactRPCURL used unanchored substring regexes (alchemy.com/v2/..., infura.io/v3/..., quiknode.pro/..., lb.drpc.live/.../..., dkey=...) to find tokens to scrub. CodeQL flagged this as risky because an attacker-controlled URL like https://evil/?u=https://alchemy.com/v2/key would only redact the embedded form, not the host context. Refactored to: - Parse the input with net/url - Match against u.Host (host anchor, not substring) - Substitute the redaction in-place on the raw string so the original [REDACTED] characters and query parameter order are preserved (avoids u.URL.String() percent-encoding [/]) TestRedactRPCURL still passes against the same eight cases.
Per follow-up review: the previous redaction surfaced the full path prefix (e.g. https://base-sepolia.g.alchemy.com/v2/[REDACTED]) which still leaks the network selector and provider plan tier in logs. Tighten both the CLI redactor and the runner-level scrubber so paid-RPC URLs collapse all the way down to the provider's registered domain: https://base-sepolia.g.alchemy.com/v2/abc123 -> https://[REDACTED].alchemy.com/[REDACTED] https://lb.drpc.org/ogrpc?network=base-sepolia&dkey=XYZ -> https://[REDACTED].drpc.org/[REDACTED]?[REDACTED] Subdomain, path, query, and fragment are all redacted; scheme + TLD are preserved so the operator can still tell which provider they pointed at without leaking which network, instance, or api key. Recognized provider TLDs: alchemy.com, infura.io, quiknode.pro, drpc.live, drpc.org. Hosts that don't match are returned unchanged. cmd/obol/network.go::redactRPCURL and flows/lib.sh::scrub_secrets stay in lockstep — both updated, with matching test coverage in cmd/obol/redact_rpc_url_test.go.
Cuts net 748 lines (1494 deletions, 746 insertions) by collapsing
overlap and deleting OpenClaw-era deep dives that no longer apply now
that Hermes is the default runtime.
Restructure
- SKILL.md now leads with a 9-row "Hard-Won Lessons" lookup table for
the release-smoke 2026-05-13 root causes (dev-rewrite bypass, CAIP-2
resolver mismatch, anvil flag misuse, combo-form regex, anvil bind
scope, public facilitator pivot, free-tier RPC throttling, first-
request flake, x402-rs arm64 packaging). Each row points at the
symptom, the file, and the commit that fixed it.
References (collapsed 10 → 7)
- dev.md ← merged dev-environment + obol-cli, condensed CLI surface
- llm-routing.md ← merged litellm-routing + qa-model-envs + overlay-generation,
cut OpenClaw-specific overlay walkthroughs
- paid-flows.md ← merged live-obol-qa + paid-commerce, sharpened success
criteria + Bob derivation
- release-smoke-debugging.md (NEW) — symptom → root-cause catalog with the
"first steps when smoke goes red" checklist
- remote-qa.md ← unchanged
- integration-testing.md ← trimmed prose, kept matrix + timing budgets
- troubleshooting.md ← cut OpenClaw stack traces, added CA-bundle and
PurchaseRequest-stuck recipes
Scripts (NEW)
- scripts/derive-bob.sh — Bob deterministic 2nd-derived wallet + balance
- scripts/check-deployed-images.sh — first diagnostic when smoke breaks
- scripts/repopulate-x402-cabundle.sh — manual CA bundle repopulate
Goal: future runs through this skill spend tokens on work, not re-deriving
facts the previous run already paid for.
The three scripts under .agents/skills/obol-stack-dev/scripts/ were parallel implementations of logic that already lives in flows/lib.sh or that the markdown documents inline: - derive-bob.sh: duplicated the cast-keccak derivation already shown in references/paid-flows.md; flows/lib.sh::fund_bob_from_alice_if_needed is the version any flow actually consumes. - repopulate-x402-cabundle.sh: the kubectl create-configmap recipe is already inline in references/troubleshooting.md, and obol stack up + obol sell http call x402verifier.PopulateCABundle automatically. - check-deployed-images.sh: five kubectl get lines already documented inline in references/release-smoke-debugging.md "First Steps". Net: -121 lines, no information lost, no external snippets to drift out of sync with flows/lib.sh or the actual code paths. Skill SKILL.md "Editing This Skill" updated to make the rule explicit: inline shell snippets in markdown; do not ship parallel implementations.
Six fixes flagged during the obol-stack-dev skill rebuild:
CLI surface
- model: was "setup, status"; now lists "setup, setup custom, prefer,
sync, list, remove, status" (setup custom is the canonical path
for OBOL_LLM_ENDPOINT smoke flows)
- agent: add "auth --runtime <runtime> obol-agent" as the canonical
Bearer-token command
- hermes: mark legacy "token" subcommand as not-recommended (it can
print CLI usage text and poison the Bearer); point to `obol agent
auth` instead
Build/test
- Differentiate the legacy local internal/openclaw integration matrix
(Ollama-driven) from the release-gate flows 11/13/14 which require
OBOL_LLM_ENDPOINT against vLLM/llama.cpp. Add the canonical
release-smoke invocation.
Pitfalls (release-smoke 2026-05-13 root causes)
- Add "first diagnostic" header: confirm the deployed image with
kubectl get deploy -o jsonpath='{...image}' before reading verifier
code. A 503 from the verifier is almost never a real verifier bug.
- Rename pitfall 2 from openclaw-monetize-binding to the Hermes
default obol-agent-monetize-rbac (legacy name kept as a footnote).
- New pitfall 9: EnsureVerifier overwrites helmfile under
OBOL_DEVELOPMENT=true (the multi-hour bug — 5a10fb8).
- New pitfall 10: CAIP-2 vs legacy chain id mismatch in chains.go
(silent 404 on every paid request).
- New pitfall 11: anvil --prune-history is enable-pruning, not
retention; also --host 0.0.0.0 for cluster reachability.
- New pitfall 12: combo-form image-pin regex (:tag@sha256:digest must
be the longest alternation arm).
- New pitfall 13: free-tier Base Sepolia RPC throttling under
release-smoke load; BASE_SEPOLIA_RPC / ALCHEMY_BASE_SEPOLIA_API_KEY
env support; secret scrubbing collapses paid-RPC URLs to TLD only.
- New pitfall 14: first-request flake on freshly-deployed verifier
(12x5s retry in flow-07/08).
- Pointer to the fuller catalog at
.agents/skills/obol-stack-dev/references/release-smoke-debugging.md.
Personal-data hygiene
- Replace hardcoded /Users/bussyjd/... checkout paths in "Related
Codebases" with repository slugs + env override variables. Mention
CLAUDE.local.md for users who want absolute paths checked in
locally.
This was referenced May 13, 2026
bussyjd
added a commit
that referenced
this pull request
May 13, 2026
Phase 7 of the post-#490 integration plan. Documents five attempts of the new flows/buy-external.sh harness against the production seller at https://inference.v1337.org/services/aeon as Bob (the deterministic 2nd-derived wallet from .env REMOTE_SIGNER_PRIVATE_KEY). Outcome: - Probe + 402 + accepts validation: green - Permit2 authorization signing: green (1 auth for 0.023 OBOL) - PurchaseRequest CR creation in hermes-obol-agent: green - Controller reconciliation: BLOCKED — observedGeneration never advanced past 0; kubectl exec session SIGKILLed (exit 137). Hypothesis: the serviceoffer-controller does not reconcile PRs whose endpoint does not match any in-cluster ServiceOffer.spec.upstream (external-seller mode is unimplemented). - Settlement: not attempted (gated on PR Ready). Bob OBOL balance unchanged across the test (4.978 OBOL). Surfaced and fixed four bugs along the way: - k3d 32-char cluster-name cap (7554b5e) - CAIP-2 vs legacy chain id mismatch in the harness probe (3d8e231) - Cloudflare WAF blocking Python-urllib UA in buy.py (c2dddc1) - Stale .build/obol vs freshly-rebuilt .workspace/bin/obol (operator- level fix, not committed code — follow-up #2 in the report) Includes follow-ups for the controller-side external-seller gap, the harness binary-path normalization, a CF-WAF UA note for the troubleshooting reference, and adding a KEEP_CLUSTER_ON_FAIL knob so the next external-seller failure has live diagnostic state to inspect.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
What changed: Single integration branch that brings six reviewed PRs onto current
mainand hardens the release-smoke runner + flows against the regressions surfaced by the spark1 dry-runs. Supersedes #476, #477, #478, #479, #483, #484 (and folds in the closed #468 frontend digest-pin attempt).Why it matters: All six changes had to be validated together against the full release-smoke gate before merging — the runner refactor (#476), the four flow fixes (#477, #478, #479, #483), and the frontend pin (#484) are interdependent for a green smoke. Bundling them gives a single, reviewable supersede commit with one full end-to-end validation, plus eight additional root-cause fixes uncovered while driving the gate green (full retrospective in
plans/release-smoke-hardening-20260513.md).Risk level: medium — runner mechanics, flow scripts, one test default, one image digest, plus the verifier in-memory rewrite (
internal/x402/setup.go) and CAIP-2 chain resolver (internal/x402/chains.go) which are runtime-path. Each carries a regression test.Commit under test:
4082961Base branch:
mainScope
Validation
CI checks:
Unit tests:
Integration tests:
Flow tests:
obol-stack-qa-20260512-195006-rs-with-pinrelease-smoke-20260513-finalgreen-1-artifacts/flow-01-prerequisites.logflow-02-stack-init-up.logflow-03-inference.logflow-04-agent.logflow-05-network.logflow-06-sell-setup.logflow-07-sell-verify.logflow-10-anvil-facilitator.logflow-08-buy.logflow-09-lifecycle.logx402.gcp.obol.techflow-11-receipts/x402.gcp.obol.techflow-14-receipts/flow-13-receipts/Release smoke:
Live Chain Evidence
Network: Base Sepolia (eip155:84532) and Anvil fork of the same
RPC/provider: paid drpc.live load-balancer for Base Sepolia (
lb.drpc.live/base-sepolia/[REDACTED], supplied viaBASE_SEPOLIA_RPCenv). eRPC gateway in-cluster routesmainnet,base,base-sepolia,hoodiper the embedded networks.Facilitator:
https://x402.gcp.obol.tech(production overlay 1.4.9, postobol-infrastructure#2612);host.k3d.internal:8404for the local x402-rs in flow-10/13.Contracts and tokens:
Wallet roles:
REMOTE_SIGNER_PRIVATE_KEY(.env)obol-infrastructureBalances and deltas: per-flow
*-receipts/directories underrelease-smoke-20260513-finalgreen-1-artifacts/capture before/after for USDC and OBOL on each paid call.flow-11-dual-stack.shandflow-13-dual-stack-obol.shassert exactPAID_AMOUNTdeltas in both directions and the run is green, which is the source of truth.Transaction receipts:
flow-11-receipts/registration-receipt.jsonflow-11-receipts/metadata-receipt.json/.well-known/agent-registration.jsonflow-11-dual-stack.log(settle tx hash)flow-14-live-obol-base-sepolia.logflow-13-receipts/receipt-summary.jsonRuntime Evidence
QA environment:
v1.35.1-k3s11.5.1-stableqwen36-fastvia OpenAI-compatible vLLM on127.0.0.1:8000Images:
ghcr.io/obolnetwork/obol-stack-front-endv0.1.23@sha256:950b887e1cbaca9f928ff7b449b5602ed9777b629b4ee1b9c4c91fac2d74c2f2c3a4bba)ghcr.io/obolnetwork/x402-verifierOBOL_DEVELOPMENT=true(rewritten from embedded pin via 5a10fb8 + 1efbaab)ghcr.io/obolnetwork/serviceoffer-controllerghcr.io/obolnetwork/x402-buyer:tag@sha256:digest)ghcr.io/obolnetwork/litellm-forksha-c16b156ghcr.io/obolnetwork/x402-facilitator-prometheus-overlay1.4.9Kubernetes / stack:
erpc,llm,x402,hermes-obol-agent,obol-frontend,monitoring,traefik(Alice + Bob mirrored in their own workspaces)kubectl rollout status deploy/x402-verifier -n x402 --timeout=180s(per #478); flow-02 polls until Nodes Ready and eRPC/rpcModel and routing:
paid/qwen36-fast(host vLLM viaobol model setup custom)paid/* → openai/* → http://127.0.0.1:8402/v1plus per-PurchaseRequest hot-addobol agent auth --runtime hermes obol-agent(skill invariant)Artifacts and logs:
release-smoke-20260513-finalgreen-1-artifacts/RELEASE_REPORT.md(spark1)release-smoke-20260513-finalgreen-1-artifacts/flow-XX-NAME.logteeoutput, paid-RPC tokens scrubbedflow-11-receipts/flow-14-receipts/flow-13-receipts/plans/release-smoke-hardening-20260513.mdDemo readiness:
/.well-known/agent-registration.jsonreachable via tunnelbuy.py probereturns 402 pricing; flow-11 step 38obol buy inferencediscovers Alice via ERC-8004paid/qwen36-fastreturns 200 (flow-11/14/13 paid call steps)Review Notes
Superseded PRs (close on merge of this branch):
128f544)lib.sh(04613d7)d5ba18e)80b17e6)756aeb5)c3a4bba)Already-merged-to-main, listed for completeness only: #466, #469, #470, #471, #472.
Known gaps:
debug(...)log statements in the verifier and buyer (66d72c2,2045198,5ec24f5,3dd7cc9) and the flow-11 cluster-state diag dump are intentionally left in this PR — they were the only thing that surfaced theEnsureVerifierroot cause and remain useful operational signal. Each commit message marks them "Remove once root cause identified". Suggest stripping in a follow-up after one release cycle.ObolNetwork/x402-rsarm64 prometheus-overlay packaging bug (1.4.9 arm64 manifest contains an amd64 binary). Local fix prepared at$CLAUDE_JOB_DIR/x402-rs-fixon branchfix/multiarch-overlay-arm64; awaits explicit user authorization to push. Workaround in this branch:X402_FACILITATOR_SKIP_PULL=true+ locally-built arm64 image on QA hosts.ALCHEMY_BASE_SEPOLIA_API_KEY/BASE_SEPOLIA_RPC) and thewarn_unpaid_base_sepolia_rpcpreflight, but the underlying free-tier flakiness remains for users without a paid key.Follow-ups:
.claude/skills/obol-stack-dev/reference page on thehelmfile vs EnsureVerifierdeployment dual-path so this footgun doesn't bite again.lib.shassertion helper docs could explicitly call out "patterns are ERE; do not use BRE\|". The integration adds the conversions but the helper signature hasn't changed.Reviewer focus:
internal/x402/setup.go(5a10fb8) — the in-memorykubectl applyrewrite is the key correctness change; please confirm theOBOL_DEVELOPMENTgate keeps production behaviour identical. Test ininternal/x402/manifest_devmode_test.go.internal/x402/chains.go(3cc2e7e) —ResolveChainInfonow accepts both legacy and CAIP-2; please confirm no other code path expects strict legacy-only.internal/defaults/defaults.go(1efbaab) — combo-form regex; the test ininternal/defaults/defaults_test.go(5764ad4) exercises all four pin shapes (longest-first alternation matters).flows/flow-10-anvil-facilitator.sh(86588aa) — anvil flag change; reusing-existing-anvil branch now rejects any anvil that carries--prune-history.flows/release-smoke.shrunner refactor (refactor(flows): collapse per-flow env injection in release-smoke runner #476,128f544) — verifyenv "$env_name=$env_value" bash $flowis semantically identical to the prior inlineVAR=val bash $flowform. Receipt-dir population end-to-end on spark1 proves it.flows/flow-07-sell-verify.sh+flows/flow-08-buy.sh402-body retry (b46f5d9) — confirms first-request flake on freshly-deployed verifier is absorbed by the 12×5s loop.