Skip to content

release-smoke: green 13/13 (supersedes #476 #477 #478 #479 #483 #484)#490

Merged
bussyjd merged 39 commits into
mainfrom
integration/release-smoke-hardening-20260512
May 13, 2026
Merged

release-smoke: green 13/13 (supersedes #476 #477 #478 #479 #483 #484)#490
bussyjd merged 39 commits into
mainfrom
integration/release-smoke-hardening-20260512

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 13, 2026

Summary

What changed: Single integration branch that brings six reviewed PRs onto current main and hardens the release-smoke runner + flows against the regressions surfaced by the spark1 dry-runs. Supersedes #476, #477, #478, #479, #483, #484 (and folds in the closed #468 frontend digest-pin attempt).

Why it matters: All six changes had to be validated together against the full release-smoke gate before merging — the runner refactor (#476), the four flow fixes (#477, #478, #479, #483), and the frontend pin (#484) are interdependent for a green smoke. Bundling them gives a single, reviewable supersede commit with one full end-to-end validation, plus eight additional root-cause fixes uncovered while driving the gate green (full retrospective in plans/release-smoke-hardening-20260513.md).

Risk level: medium — runner mechanics, flow scripts, one test default, one image digest, plus the verifier in-memory rewrite (internal/x402/setup.go) and CAIP-2 chain resolver (internal/x402/chains.go) which are runtime-path. Each carries a regression test.

Commit under test: 4082961

Base branch: main

Scope

  • Code
  • Charts / manifests
  • Flows / QA scripts
  • Docs / skills
  • Images / dependencies
  • Other: regression tests for the dev-rewrite combo form

Validation

CI checks:

Check Status Link
(filled after PR opens)

Unit tests:

go test ./cmd/obol ./internal/x402 ./internal/inference ./internal/tunnel ./internal/serviceoffercontroller ./internal/defaults -count=1
ok  github.com/ObolNetwork/obol-stack/cmd/obol
ok  github.com/ObolNetwork/obol-stack/internal/x402
ok  github.com/ObolNetwork/obol-stack/internal/inference
ok  github.com/ObolNetwork/obol-stack/internal/tunnel
ok  github.com/ObolNetwork/obol-stack/internal/serviceoffercontroller
ok  github.com/ObolNetwork/obol-stack/internal/defaults

Integration tests:

(not run — flow tests below are the end-to-end signal)

Flow tests:

Flow Network QA machine label Worktree Result Artifacts
flow-01 prerequisites n/a spark1 obol-stack-qa-20260512-195006-rs-with-pin PASS release-smoke-20260513-finalgreen-1-artifacts/flow-01-prerequisites.log
flow-02 stack-init-up local k3d spark1 same PASS flow-02-stack-init-up.log
flow-03 inference local spark1 same PASS flow-03-inference.log
flow-04 agent local spark1 same PASS flow-04-agent.log
flow-05 network base-sepolia (drpc paid) spark1 same PASS flow-05-network.log
flow-06 sell-setup local spark1 same PASS flow-06-sell-setup.log
flow-07 sell-verify local spark1 same PASS flow-07-sell-verify.log
flow-10 anvil-facilitator Anvil fork base-sepolia spark1 same PASS flow-10-anvil-facilitator.log
flow-08 buy local + Anvil fork spark1 same PASS flow-08-buy.log
flow-09 lifecycle local spark1 same PASS flow-09-lifecycle.log
flow-11 dual-stack USDC live Base Sepolia + x402.gcp.obol.tech spark1 same PASS flow-11-receipts/
flow-14 live OBOL live Base Sepolia + x402.gcp.obol.tech spark1 same PASS flow-14-receipts/
flow-13 fork OBOL Anvil fork OBOL + local x402-rs spark1 same PASS flow-13-receipts/

Release smoke:

ssh spark1
cd ~/obol-stack-qa-20260512-195006-rs-with-pin
git rev-parse HEAD                       # 4082961

RELEASE_SMOKE_INCLUDE_OBOL=true \
RELEASE_SMOKE_INCLUDE_OBOL_FORK=true \
OBOL_DEVELOPMENT=true \
OBOL_NONINTERACTIVE=true \
OBOL_LLM_ENDPOINT=http://127.0.0.1:8000/v1 OBOL_LLM_MODEL=qwen36-fast \
X402_FACILITATOR_SKIP_PULL=true \
RELEASE_SMOKE_RUN_ID=20260513-finalgreen-1 \
bash flows/release-smoke.sh

# Final marker:
__SMOKE_DONE_RC__=0
"Release smoke passed"
13/13 flows PASS, 0 FAIL lines

Live Chain Evidence

Network: Base Sepolia (eip155:84532) and Anvil fork of the same

RPC/provider: paid drpc.live load-balancer for Base Sepolia (lb.drpc.live/base-sepolia/[REDACTED], supplied via BASE_SEPOLIA_RPC env). eRPC gateway in-cluster routes mainnet, base, base-sepolia, hoodi per the embedded networks.

Facilitator: https://x402.gcp.obol.tech (production overlay 1.4.9, post obol-infrastructure#2612); host.k3d.internal:8404 for the local x402-rs in flow-10/13.

Contracts and tokens:

Name Address Version / notes
USDC (Base Sepolia) 0x036CbD53842c5426634e7929541eC2318f3dCF7e EIP-3009 transferWithAuthorization
OBOL (Base Sepolia) 0x0a09371a8b011d5110656ceBCc70603e53FD2c78 ERC-2612 + Permit2; flow-14 / flow-13
Permit2 0x000000000022D473030F116dDEE9F6B43aC78BA3 canonical
ERC-8004 IdentityRegistry (Base Sepolia) 0x8004A818BFB912233c491871b3d84c89A494BD9e flow-11 / flow-14 / flow-13 register

Wallet roles:

Role Address Source
Alice / seller / register 0xC0De030F6C37f490594F93fB99e2756703c4297E derived index 0 from REMOTE_SIGNER_PRIVATE_KEY (.env)
Bob / buyer / payer 0x57b0eF875DeB5A37301F1640E469a2129Da9490E derived index 1 from same key — pre-seeded into Bob's signer per flow invariant
Facilitator / receiver x402-rs facilitator key (Obol-operated) helm secret in obol-infrastructure

Balances and deltas: per-flow *-receipts/ directories under release-smoke-20260513-finalgreen-1-artifacts/ capture before/after for USDC and OBOL on each paid call. flow-11-dual-stack.sh and flow-13-dual-stack-obol.sh assert exact PAID_AMOUNT deltas in both directions and the run is green, which is the source of truth.

Transaction receipts:

Purpose Source artifact Notes
ERC-8004 registration flow-11-receipts/registration-receipt.json seller → IdentityRegistry, RegistrationRequested
Metadata / service offer flow-11-receipts/metadata-receipt.json controller-published /.well-known/agent-registration.json
Settlement transfer (USDC) inline in flow-11-dual-stack.log (settle tx hash) x402-buyer → Alice, USDC Transfer
Settlement transfer (OBOL live) inline in flow-14-live-obol-base-sepolia.log x402-buyer → Alice, OBOL Transfer
Settlement transfer (OBOL fork) flow-13-receipts/receipt-summary.json local x402-rs settlement on Anvil fork

Runtime Evidence

QA environment:

Item Value
OS / arch Linux aarch64 (spark1)
Backend k3d 5.8.3 on Docker, k3s v1.35.1-k3s1
Tool versions go 1.25, kubectl 1.35.3, helm 3.20.1, helmfile 1.4.3, foundry anvil 1.5.1-stable
QA agent/model Hermes default agent, model qwen36-fast via OpenAI-compatible vLLM on 127.0.0.1:8000

Images:

Component Image Tag / digest Source
frontend ghcr.io/obolnetwork/obol-stack-front-end v0.1.23@sha256:950b887e1cbaca9f928ff7b449b5602ed9777b629b4ee1b9c4c91fac2d74c2f2 this PR (#484 incorporated as c3a4bba)
x402-verifier ghcr.io/obolnetwork/x402-verifier locally rebuilt under OBOL_DEVELOPMENT=true (rewritten from embedded pin via 5a10fb8 + 1efbaab) this branch
serviceoffer-controller ghcr.io/obolnetwork/serviceoffer-controller locally rebuilt this branch
x402-buyer ghcr.io/obolnetwork/x402-buyer locally rebuilt (combo-form regex now correctly strips :tag@sha256:digest) this branch
LiteLLM ghcr.io/obolnetwork/litellm-fork sha-c16b156 embedded
Public facilitator ghcr.io/obolnetwork/x402-facilitator-prometheus-overlay 1.4.9 obol-infrastructure#2612 (admin-merged)

Kubernetes / stack:

Item Value
Stack IDs three k3d clusters per smoke (default + Alice + Bob), all clean-deleted by the runner trap
Namespaces erpc, llm, x402, hermes-obol-agent, obol-frontend, monitoring, traefik (Alice + Bob mirrored in their own workspaces)
Pod readiness kubectl rollout status deploy/x402-verifier -n x402 --timeout=180s (per #478); flow-02 polls until Nodes Ready and eRPC /rpc
Cleanup result release-smoke EXIT trap reset default + alice + bob workspaces; orphan facilitator container also removed

Model and routing:

Item Value
Agent/model used paid/qwen36-fast (host vLLM via obol model setup custom)
LiteLLM route static paid/* → openai/* → http://127.0.0.1:8402/v1 plus per-PurchaseRequest hot-add
Paid endpoint status flow-11 step 43 + flow-14 step 50 + flow-13 step 50 all PASS with HTTP 200
Auth token source obol agent auth --runtime hermes obol-agent (skill invariant)

Artifacts and logs:

Artifact Location / link Notes
RELEASE_REPORT.md release-smoke-20260513-finalgreen-1-artifacts/RELEASE_REPORT.md (spark1) per-flow PASS/FAIL/SKIP + counts
Per-flow logs release-smoke-20260513-finalgreen-1-artifacts/flow-XX-NAME.log release-smoke tee output, paid-RPC tokens scrubbed
flow-11 receipts flow-11-receipts/ metadata + registration receipts
flow-14 receipts flow-14-receipts/ registration receipt
flow-13 receipts flow-13-receipts/ alice-mint + funding + anvil + facilitator + summary
Retrospective plans/release-smoke-hardening-20260513.md this branch — eight root causes documented

Demo readiness:

Item Status Notes
Seller visible / registered green flow-11 step 22 PASS — Agent ID minted on Base Sepolia, receipt archived; /.well-known/agent-registration.json reachable via tunnel
Buyer discovery works green flow-08 buy.py probe returns 402 pricing; flow-11 step 38 obol buy inference discovers Alice via ERC-8004
Paid route works green LiteLLM paid/qwen36-fast returns 200 (flow-11/14/13 paid call steps)
Settlement visible on-chain green flow-11 step 45/46 PASS — Transfer event observed, balance deltas exact; flow-13 receipt-summary.json shows Bob→Alice Transfer

Review Notes

Superseded PRs (close on merge of this branch):

Already-merged-to-main, listed for completeness only: #466, #469, #470, #471, #472.

Known gaps:

  • Four debug(...) log statements in the verifier and buyer (66d72c2, 2045198, 5ec24f5, 3dd7cc9) and the flow-11 cluster-state diag dump are intentionally left in this PR — they were the only thing that surfaced the EnsureVerifier root cause and remain useful operational signal. Each commit message marks them "Remove once root cause identified". Suggest stripping in a follow-up after one release cycle.
  • Upstream ObolNetwork/x402-rs arm64 prometheus-overlay packaging bug (1.4.9 arm64 manifest contains an amd64 binary). Local fix prepared at $CLAUDE_JOB_DIR/x402-rs-fix on branch fix/multiarch-overlay-arm64; awaits explicit user authorization to push. Workaround in this branch: X402_FACILITATOR_SKIP_PULL=true + locally-built arm64 image on QA hosts.
  • flow-14 Base Sepolia free-tier RPC throttling on Bob Permit2 approval is now mitigated by the paid-RPC support (ALCHEMY_BASE_SEPOLIA_API_KEY / BASE_SEPOLIA_RPC) and the warn_unpaid_base_sepolia_rpc preflight, but the underlying free-tier flakiness remains for users without a paid key.

Follow-ups:

  • Strip the four debug log statements once the new path is observed in production for one release cycle.
  • Add a .claude/skills/obol-stack-dev/ reference page on the helmfile vs EnsureVerifier deployment dual-path so this footgun doesn't bite again.
  • lib.sh assertion helper docs could explicitly call out "patterns are ERE; do not use BRE \|". The integration adds the conversions but the helper signature hasn't changed.

Reviewer focus:

  • internal/x402/setup.go (5a10fb8) — the in-memory kubectl apply rewrite is the key correctness change; please confirm the OBOL_DEVELOPMENT gate keeps production behaviour identical. Test in internal/x402/manifest_devmode_test.go.
  • internal/x402/chains.go (3cc2e7e) — ResolveChainInfo now accepts both legacy and CAIP-2; please confirm no other code path expects strict legacy-only.
  • internal/defaults/defaults.go (1efbaab) — combo-form regex; the test in internal/defaults/defaults_test.go (5764ad4) exercises all four pin shapes (longest-first alternation matters).
  • flows/flow-10-anvil-facilitator.sh (86588aa) — anvil flag change; reusing-existing-anvil branch now rejects any anvil that carries --prune-history.
  • flows/release-smoke.sh runner refactor (refactor(flows): collapse per-flow env injection in release-smoke runner #476, 128f544) — verify env "$env_name=$env_value" bash $flow is semantically identical to the prior inline VAR=val bash $flow form. Receipt-dir population end-to-end on spark1 proves it.
  • flows/flow-07-sell-verify.sh + flows/flow-08-buy.sh 402-body retry (b46f5d9) — confirms first-request flake on freshly-deployed verifier is absorbed by the 12×5s loop.

bussyjd and others added 30 commits May 12, 2026 16:05
run_flow now accepts optional env_name/env_value args, replacing three
hand-coded `if name == flow-XX` branches. env "VAR=val" bash $flow is
semantically identical to the inline `VAR=val bash $flow` form for
child-only export. Validated end-to-end on spark1: flow-11/13/14
each populated their FLOW{11,13,14}_ARTIFACT_DIR receipt subdirs.

Also:
- Defensive ${var:-0} on FAIL/SKIP grep counters so a missing log
  can't crash the [ "$var" -eq 0 ] arithmetic under set -e.
- Inlined the one-shot cleanup_default_stack_before_dual helper.
- Brief comment on why flow-10 sits between flow-07 and flow-08
  (flow-08 header explicitly requires flow-10's Anvil facilitator).
…terns

lib.sh's run_step_grep and poll_step_grep helpers both use `grep -qE`
(extended regex) — comment in the helper explicitly says callers may
use ERE quantifiers. But several callsites passed BRE alternation
(`\|`), which in ERE is a literal backslash-pipe and never matches.
The smoke run on spark1 surfaced this as cascading "pattern not found"
FAILs in flow-01 / flow-02 / flow-05 / flow-06 / flow-10, even though
the expected substrings were present in the CLI output.

Convert all 8 BRE call sites to ERE `|`. Also widen the patterns where
the CLI output has drifted since they were written:

  - `obol network list` now prints `Local Nodes:` / `Remote RPCs:` /
    `mainnet … chain=…` instead of network shortnames; the old
    `ethereum|aztec` pattern would never have matched the current
    output even with correct ERE syntax.
  - `obol network status` now prints `eRPC Gateway Status` / `Pod:` /
    upstream rows; added `Running` for completeness.
  - flow-10 facilitator-pricing check now also accepts the
    `configmap/x402-pricing configured` line that `obol sell pricing`
    actually emits.
flow-08:
  Drop the hard "agent wallet == deterministic Bob" assertion. Bob
  pre-seeding is the flow-11/13/14 dual-stack pattern; single-stack
  flow-08 uses whatever wallet `obol agent init` generated, and every
  downstream funding/signing step already uses $AGENT_WALLET directly.
  Keep BOB_WALLET derivation when REMOTE_SIGNER_PRIVATE_KEY is set and
  report a match for transparency, but PASS either way.

flow-07 + flow-10 x402-verifier readiness:
  Replace pod-counting loops with `kubectl rollout status
  deployment/x402-verifier`. The old loops counted every pod in the
  x402 namespace, so when stuck old ReplicaSets or the unrelated
  serviceoffer-controller sat in Pending (real condition observed on
  spark1 under host load), the loops never converged. rollout status
  is the authoritative readiness signal and only tracks the latest
  ReplicaSet. Bumped the timeout to 180s in both spots to absorb
  image-pull jitter on cold caches.
Two cold-start races surfaced on spark1's smoke run:

1. `obol stack up` returns once the k3s API responds, but the k3d
   node can take another few seconds to appear in `kubectl get nodes`.
   Old `run_step_grep "Nodes ready" "Ready" kubectl get nodes` raced
   that window and FAILed on "No resources found". Switched to
   `poll_step_grep` with 12×5s = 60s ceiling. Also tightened the
   pattern to ` Ready ` (surrounding spaces) so the word "NotReady"
   in the status column does not satisfy the match.

2. eRPC's HTTP listener becomes reachable before its upstream pool
   has fully resolved every alias. A one-shot GET /rpc moments after
   pods report Running often returns a partial list missing
   `base-sepolia`, which then cascades into the chains-OK and
   JSON-RPC checks. Poll the first eRPC assertion until base-sepolia
   appears (or 60s elapses); the subsequent assertions then have a
   stable list to reason about.
PR #470 removed the hardcoded "0.001" default from the --price flag so
that --per-mtok / --per-request can take effect when --price is not
passed. The flag-defaults table test still asserted the old default
"0.001" and broke on merge. Update it to expect empty string, matching
the post-#470 behavior.
Switch the frontend image reference from tag-only ("v0.1.23") to
tag+digest ("v0.1.23@sha256:950b887e1cbaca9f928ff7b449b5602ed9777b629b4ee1b9c4c91fac2d74c2f2").
The tag stays for human readability; the digest is authoritative.

Eliminates the mutable-tag attack surface flagged as a non-blocking
follow-up by the supply-chain review of v0.10.0-rc2. Multi-arch index
digest covers linux/amd64 and linux/arm64.

Renders to a valid OCI reference via the obol/obol-app chart
"obol-app.image" helper (verified locally with helm template).
…tion' into integration/release-smoke-hardening-20260512
…n' into integration/release-smoke-hardening-20260512
…ifier-readiness' into integration/release-smoke-hardening-20260512
…art-polling' into integration/release-smoke-hardening-20260512
…-align' into integration/release-smoke-hardening-20260512
…1.23' into integration/release-smoke-hardening-20260512
…tes are on

flow-11, flow-14, and flow-13 each enforce a preflight that rejects the
run unless OBOL_LLM_ENDPOINT points at an OpenAI-compatible vLLM /
llama.cpp endpoint. Without this guard the runner spends ~30 min on the
baseline flows before those three FAIL on their preflights, which makes
release-smoke results look worse than they are.

Add a fail-fast guard in the runner: when RELEASE_SMOKE_INCLUDE_OBOL or
RELEASE_SMOKE_INCLUDE_OBOL_FORK is true, OBOL_LLM_ENDPOINT must be set.
Otherwise exit 2 with a copy-pasteable env block pointing at the QA
references.

Surfaced by a spark1 dry-run where the dual-stack flows guaranteed FAIL
because the runner inherited a host-Ollama context (no QA vLLM) — three
flows wasted setup work to fail at their preflight.
Anvil's default bind is 127.0.0.1, which makes it reachable from the
host (so flow-10's own health check passes) but NOT from inside the
k3d cluster via host.k3d.internal:8545 (Connection refused).

That mismatch was the root cause of the "Payment verification failed
503" we saw cascading through flow-08 / flow-11 / flow-14 / flow-13
in every smoke run on every host:

  1. flow-10 starts anvil at 127.0.0.1:8545 + reports PASS (host-local
     health check works).
  2. eRPC's `custom-84532-0` upstream points at host.k3d.internal:8545
     and silently times out on every base-sepolia call from cluster pods.
  3. `buy.py` inside the obol-agent pod calls eRPC to read the buyer
     wallet's balance; the eth_call times out with eRPC's
     "network 'evm:84532' for project 'rpc'" error.
  4. buy.py exits before creating the PurchaseRequest CR, so the
     serviceoffer-controller never hot-adds paid/<model> into LiteLLM.
  5. The smoke's paid inference step requests model
     `paid/qwen3.5:9b` (or `paid/qwen36-fast`), LiteLLM has no such
     model_group, and returns 503 wrapped as
     `ServiceUnavailableError: Payment verification failed` — which
     looks like a verifier rejection but is really "no model in
     LiteLLM".

Validation:
  - Anvil bound 0.0.0.0:8545 (confirmed via ss -tlnp).
  - eth_chainId from host.k3d.internal:8545 inside an in-cluster pod
    returned 0x14a34 (84532) — was Connection refused without the flag.

This single character + flag flip is what was breaking the entire
paid-commerce side of the smoke. Five flow FAILs trace back to it.
…verlay

All 12 references to `ghcr.io/x402-rs/x402-facilitator:1.4.7` updated to
`ghcr.io/obolnetwork/x402-facilitator-prometheus-overlay:1.4.9`.

This is the ObolNetwork-published fork of x402-rs 1.4.9 with the
Prometheus metrics overlay applied. arm64+amd64 manifest available
(verified via `docker manifest inspect`).

Updated in:
  - flows/lib.sh (release-smoke shared image const)
  - internal/testutil/facilitator_real.go (integration-test image)
  - flows/flow-10-anvil-facilitator.sh
  - flows/flow-12-obol-payment.sh
  - flows/flow-13-dual-stack-obol.sh
  - flows/README.md (operator docs)
  - internal/openclaw/monetize_integration_test.go (4 doc comments)

Surfaced while root-causing the cluster Payment-verification-503 cascade
across flow-08/11/14/13: the locally-built x402-buyer:latest image was
5 days stale (pre-PR #377 nested-format support), and pinning the
facilitator to the current fork keeps the verify/settle path consistent
with the buyer code as we re-validate the smoke.
…preflight

Two changes that close the loop on the release-smoke 503 investigation:

1. Pin x402-buyer image to a sha256 digest (multi-arch manifest list,
   amd64+arm64) in base/templates/llm.yaml. Tag-only pinning let the
   local-build path silently reuse a 5-day-old `:latest` image; that
   stale buyer pre-dated PR #377's nested-format support and serialized
   X-PAYMENT with empty authorization fields → facilitator /verify 400
   → 503 cascade across flow-08 / flow-11 / flow-14 / flow-13. Same
   pattern the verifier and serviceoffer-controller already use (they
   were @sha256-pinned before commit e157b10 dropped them to short-SHA
   tags).

2. Add a fail-fast preflight in flow-10 that probes anvil from inside
   the k3d cluster via host.k3d.internal:8545. Host-local health checks
   passing is necessary but not sufficient: default anvil binds to
   127.0.0.1 only and pods get Connection refused, which silently broke
   every paid-flow run on every host until the spark1 investigation
   surfaced it. Now the next operator that forgets `--host 0.0.0.0`
   sees a clear failure naming the root cause instead of chasing a
   misleading 503 through buy.py / verifier / facilitator.
Upstream ghcr.io/obolnetwork/x402-facilitator-prometheus-overlay:1.4.9
ships an amd64 binary inside its arm64 manifest variant (cross-build
packaging bug in ObolNetwork/x402-rs). Pulling on arm64 hosts produces
'exec format error' on first ENTRYPOINT exec, which manifests as flow-10
FAIL + 503 cascade across flow-08/11/14/13.

Add X402_FACILITATOR_SKIP_PULL=true + X402_FACILITATOR_IMAGE override
so QA hosts can build the overlay locally and run release-smoke without
touching the broken registry image. The upstream Dockerfile fix is in
flight; remove these knobs once the registry image is correct for arm64.
Free-tier Base Sepolia RPCs (drpc.org, sepolia.base.org) routinely
return HTTP 408 'Request timeout on the free tier' under release-smoke
load. Symptoms in recent runs:
- flow-11 step 8: 'Could not read Alice starting USDC balance'
  (single cast call, no retry, hit transient rate limit)
- flow-13 step 9: 'Facilitator did not become reachable'
  (anvil fork origin throttled while syncing chain state)
- flow-14 balance reads intermittently empty

The deterministic Bob wallet also gets drained across smoke runs that
complete any paid commerce step (each successful purchase spends from
Bob), so flow-14 'Bob signer OBOL balance 0 wei is below threshold'
required a manual top-up between runs.

Four small changes:

1. base_sepolia_rpc_candidates() honors ALCHEMY_BASE_SEPOLIA_API_KEY
   and prepends https://base-sepolia.g.alchemy.com/v2/<key> when set.
   The full URL is only emitted to the RPC call itself; logs use the
   existing redact_url_for_log helper.

2. release-smoke runner adds warn_unpaid_base_sepolia_rpc() preflight
   that loudly prints which steps are likely to flake when OBOL/OBOL_FORK
   gates are on without a paid RPC.

3. fund_bob_from_alice_if_needed() in lib.sh: ERC-20 transfer
   (required - balance) from Alice (SIGNER_KEY) to Bob when Bob's token
   balance is below the per-flow required threshold. Token-agnostic;
   wired for USDC in flow-11 and OBOL in flow-14.

4. flow-11 wraps the Alice starting-USDC cast call in the same 5-retry
   pattern Bob USDC + Alice ETH already had.
The runner now pipes every flow's stdout+stderr through scrub_secrets
before both the terminal and the on-disk artifact log. Caught today
when 'obol network add base-sepolia --endpoint $BASE_SEPOLIA_RPC'
echoed the full RPC URL with the embedded Alchemy API key into
flow-11-dual-stack.log:

  ==> Adding custom RPC for base-sepolia (chain ID: 84532):       https://base-sepolia.g.alchemy.com/v2/<KEY>

Patching individual call sites would have left every other CLI line
with a similar bug a leak; the runner-level filter is single-place and
covers anything new.

Patterns redacted:
  - alchemy.com/v2/<key>
  - infura.io/v3/<key>
  - quiknode.pro/<token>
  - any URL query string ?dkey=<token>      (drpc paid)
  - PRIVATE_KEY=0x<64 hex>                  (anchored on the literal)

The patterns are additive — anything we ever want to keep out of QA
logs goes here.
Add lb.drpc.live/<network>/<token> redaction pattern alongside the
existing Alchemy/Infura/QuickNode patterns. drpc paid endpoints carry
the token in the URL path (not as ?dkey= query) so the existing
?dkey= scrubber would miss them.
…acilitator

The public facilitator at x402.gcp.obol.tech ships /supported with two
distinct entries for Base Sepolia:

  {"x402Version":1, "network":"base-sepolia"}     (legacy name)
  {"x402Version":2, "network":"eip155:84532"}     (CAIP-2)

Our verifier sends the /verify body with X402Version=2 (line 239 of
forwardauth.go) but was passing offer.Spec.Payment.Network straight
through as RouteRule.Network — i.e. the literal string the operator
typed on the CLI ("base-sepolia"). The mismatch (v2 envelope +
v1 network string) hit the v2 strict-match path in the facilitator
and returned HTTP 400 invalid_format:

  x402: facilitator verify error: facilitator verify failed (400): invalid_format

Local facilitators (single-chain anvil fork) were lenient enough to
accept the loose form, which is why flow-08 PASS but flow-11/14 FAIL.

Fix: normalize through the existing public x402.NormalizeNetworkID
helper (chains.go:151) before stamping the RouteRule. That helper
returns CAIP-2 form for known names and passes already-CAIP-2 values
through unchanged, so this is no-op for operators who already use
"eip155:84532" but rescues every operator who used the friendly name.
Temporary instrumentation to surface the exact wire payload during
the post-pivot 503 investigation. Public x402.gcp.obol.tech 1.4.9
overlay returns 400 invalid_format even after the CAIP-2 network
normalization fix (fd95dc5), so we need to see the actual JSON the
verifier sends to find what else the facilitator rejects.

Also adds TestRoutesFromStore_NetworkCAIP2Normalization which confirms
the fd95dc5 fix correctly maps legacy network names to CAIP-2
('base-sepolia' -> 'eip155:84532') in the published RouteRule.

Remove both log.Printf calls once the root cause is identified.
Before bash trap teardown wipes the k3d clusters, dump the Alice/Bob
verifier + buyer logs and the ServiceOffer/PurchaseRequest CR state
into FLOW11_ARTIFACT_DIR/flow11-step43-debug/. Without this snapshot,
kubectl logs returns 'connection refused' immediately after the FAIL
line lands and the /verify outbound body + facilitator response are
lost.
Surface which field (network/payTo/asset/amount/auths) rejects the
payment requirement so we can pinpoint why the buyer falls through
to forwarding-without-X-PAYMENT after the CAIP-2 normalization fix.

Remove once root cause identified.
The buyer was pinned in llm.yaml as 'x402-buyer:b13254e@sha256:446d...'.
The old dev-rewrite regex matched only the ':b13254e' part, leaving the
'@sha256:<digest>' suffix intact. Docker honors the digest over the tag
on a pull, so the local-build path was silently bypassed — every
'OBOL_FORCE_REBUILD_LOCAL_DEV_IMAGES=x402-buyer' run was a no-op against
the registry image.

Effect: source changes to internal/x402/buyer/* never reached the running
sidecar in OBOL_DEVELOPMENT=true mode (root cause of the missing CanSign
DENY logs during flow-11 step 43 chase). Verifier+controller weren't
affected because llm.yaml was the only file using the combo form.

Fix: prepend a ':<tag>@sha256:<digest>' alternative to the regex so the
engine consumes the whole pin before falling through to the shorter
':<tag>' or '@sha256:<digest>' alternatives. Verified against all four
pin shapes (combo / digest-only / tag-only / :latest no-op).
The CLI was echoing the full --endpoint URL when adding a custom RPC,
which leaked Alchemy/Infura/drpc/QuickNode API keys into release-smoke
logs (flow-11 captured 'https://lb.drpc.live/base-sepolia/<token>' in
plaintext today). The release-smoke runner has a scrub_secrets filter
that catches this at log-write time, but standalone flow invocations
and direct interactive use don't go through that filter.

Move redaction into the CLI itself so the leak surface narrows to:
- the underlying eRPC ConfigMap (cluster-internal, RBAC-gated)
- the dev's shell history of the original 'obol network add ...' call

Patterns covered (mirror flows/lib.sh scrub_secrets so coverage stays
in lockstep across the two layers).
To pinpoint why the previous run's verifier log shows only middleware
WARNING entries (no /verify outbound body, no /verify response status).
Adds a startup-of-handler log so we can tell whether the buyer is
forwarding requests without X-PAYMENT (middleware short-circuits to 402
before reaching the facilitator) or whether the verifier never sees the
request at all.

Remove once root cause is identified.
Locks in the fix from 1efbaab so a future contributor doesn't
accidentally regress the rewrite back to the shorter ':<hex>'-only
regex. The combo form 'image:tag@sha256:digest' must be rewritten to a
clean ':latest' with NO orphan '@sha256:...' suffix, otherwise Docker
honors the digest and silently bypasses the local build (root cause of
the missing debug-log saga during flow-11 step 43 chase, May 2026).
…EVELOPMENT

EnsureVerifier read the embedded x402.yaml verbatim and 'kubectl apply'-ed
it, overwriting whatever helmfile sync had deployed with the embedded
':b13254e' image pins. Under OBOL_DEVELOPMENT=true this silently
bypassed the local-build path: the rewrite-to-:latest that the defaults
pipeline performs on the on-disk file copy doesn't reach the embedded
bytes EnsureVerifier loads.

Effect on flow-11 step 43 chase: Alice's verifier pod ran ':b13254e'
(stale registry image) despite the on-disk x402.yaml saying ':latest'
and OBOL_FORCE_REBUILD_LOCAL_DEV_IMAGES=true producing a fresh local
binary. The fresh debug logs I added to forwardauth/verifier source
were in the local binary but never reached the deployed pod, which is
why hours of debug-instrumentation runs produced zero new diagnostic
output. The cluster was running 5-day-old code.

Fix: apply the same rewrite-to-:latest patterns to the in-memory
manifest before kubectl-apply, gated by OBOL_DEVELOPMENT=true. The
patterns mirror internal/defaults.rewriteDevDigestPins (duplicated
locally to avoid an import cycle); a TestX402Manifest_DevModeRewritesPins
regression test locks the parity.

Production behavior unchanged: without OBOL_DEVELOPMENT=true the
embedded immutable pins are kubectl-applied verbatim.
The previous strict match for 'remaining=5' fails when the buyer's
auth pool ends up at 10 (controller merge on rerun of the
PurchaseRequest, or pre-existing auths from a prior flow-14 attempt).
The semantic we actually want to verify is 'buy provisioned at least
the requested count', not 'exactly the requested count'.
bussyjd added 4 commits May 13, 2026 04:24
fd95dc5 normalized RouteRule.Network to CAIP-2 form (eip155:84532)
before this resolver was taught to map CAIP-2 ids back to ChainInfo.
The pre-resolution at verifier.go:53 then failed silently:
ResolveChainInfo('eip155:84532') returned 'unsupported chain', the
chain registry never gained an entry for the CAIP-2 key, and
matchPaidRouteFull returned (..., false) on every paid request — 404
where the smoke test expected 402. flow-11 step 21 "Alice: 402 gate"
went from passing to failing.

Add each CAIP-2 id as an additional alias in the case arms so both
legacy names and CAIP-2 ids resolve to the same ChainInfo. Also extend
the unsupported-chain error message to mention CAIP-2 acceptability.
The previous '--prune-history 1000000' was added under the
misunderstanding that it requested 1M blocks of state retention. anvil's
docs state the opposite:

  --prune-history [<PRUNE_HISTORY>]
      Don't keep full chain history. If a number argument is specified,
      at most this number of [states are retained].

So passing the flag ENABLES pruning. With it set, anvil pruned the
fork-block state and the local x402-rs facilitator's eth_getStorageAt
for the EIP-3009 nonce slot returned HTTP 500
'state at block #N is pruned', failing flow-08 step 12 paid-inference
with a misleading 'Payment verification failed' 503.

Drop the flag entirely. Without it, anvil keeps full history from the
fork block onward, which is what flow-08's facilitator path expects.
Update the existing-anvil reuse check + the historical-state assertion
message to match.
The first POST to /services/<name>/* immediately after the verifier
deployment becomes Ready can return an empty body / Bad Gateway from
Traefik because the in-cluster HTTPRoute is wired but the verifier's
serviceoffer-source watcher hasn't yet loaded the route into its
in-memory map. Subsequent requests are fine.

Wrap both step assertions in a 12x5s retry loop that breaks the
moment the response parses as JSON.
Outcome of the multi-day chase that took the smoke from 'flow-11 step 43
broken across the board' to 13/13 PASS on spark1. Documents the eight
distinct root causes, the matching commits in this branch, and the
out-of-scope follow-ups (upstream x402-rs multi-arch fix, debug-log
cleanup, OBOL_DEVELOPMENT deployment-story doc).
Comment thread cmd/obol/network.go Fixed
Comment thread cmd/obol/network.go Fixed
Comment thread internal/x402/verifier.go Fixed
bussyjd added 5 commits May 13, 2026 12:38


Three CodeQL findings introduced by this branch:

1. internal/x402/verifier.go:165 (go/log-injection, error)
   Debug log printed r.URL.Path (user-controlled) without sanitization.
   The log was added during the EnsureVerifier root-cause hunt; the
   root cause is now identified and fixed in 5a10fb8, so the log can
   come out cleanly.

2-3. cmd/obol/network.go:585-586 (go/regex/missing-regexp-anchor x2)
   redactRPCURL used unanchored substring regexes (alchemy.com/v2/...,
   infura.io/v3/..., quiknode.pro/..., lb.drpc.live/.../..., dkey=...)
   to find tokens to scrub. CodeQL flagged this as risky because an
   attacker-controlled URL like https://evil/?u=https://alchemy.com/v2/key
   would only redact the embedded form, not the host context.

   Refactored to:
   - Parse the input with net/url
   - Match against u.Host (host anchor, not substring)
   - Substitute the redaction in-place on the raw string so the
     original [REDACTED] characters and query parameter order are
     preserved (avoids u.URL.String() percent-encoding [/])

   TestRedactRPCURL still passes against the same eight cases.
Per follow-up review: the previous redaction surfaced the full path
prefix (e.g. https://base-sepolia.g.alchemy.com/v2/[REDACTED]) which
still leaks the network selector and provider plan tier in logs.

Tighten both the CLI redactor and the runner-level scrubber so paid-RPC
URLs collapse all the way down to the provider's registered domain:

  https://base-sepolia.g.alchemy.com/v2/abc123
    -> https://[REDACTED].alchemy.com/[REDACTED]

  https://lb.drpc.org/ogrpc?network=base-sepolia&dkey=XYZ
    -> https://[REDACTED].drpc.org/[REDACTED]?[REDACTED]

Subdomain, path, query, and fragment are all redacted; scheme + TLD
are preserved so the operator can still tell which provider they
pointed at without leaking which network, instance, or api key.

Recognized provider TLDs: alchemy.com, infura.io, quiknode.pro,
drpc.live, drpc.org. Hosts that don't match are returned unchanged.

cmd/obol/network.go::redactRPCURL and flows/lib.sh::scrub_secrets
stay in lockstep — both updated, with matching test coverage in
cmd/obol/redact_rpc_url_test.go.
Cuts net 748 lines (1494 deletions, 746 insertions) by collapsing
overlap and deleting OpenClaw-era deep dives that no longer apply now
that Hermes is the default runtime.

Restructure
- SKILL.md now leads with a 9-row "Hard-Won Lessons" lookup table for
  the release-smoke 2026-05-13 root causes (dev-rewrite bypass, CAIP-2
  resolver mismatch, anvil flag misuse, combo-form regex, anvil bind
  scope, public facilitator pivot, free-tier RPC throttling, first-
  request flake, x402-rs arm64 packaging). Each row points at the
  symptom, the file, and the commit that fixed it.

References (collapsed 10 → 7)
- dev.md            ← merged dev-environment + obol-cli, condensed CLI surface
- llm-routing.md    ← merged litellm-routing + qa-model-envs + overlay-generation,
                      cut OpenClaw-specific overlay walkthroughs
- paid-flows.md     ← merged live-obol-qa + paid-commerce, sharpened success
                      criteria + Bob derivation
- release-smoke-debugging.md (NEW) — symptom → root-cause catalog with the
  "first steps when smoke goes red" checklist
- remote-qa.md      ← unchanged
- integration-testing.md  ← trimmed prose, kept matrix + timing budgets
- troubleshooting.md ← cut OpenClaw stack traces, added CA-bundle and
  PurchaseRequest-stuck recipes

Scripts (NEW)
- scripts/derive-bob.sh             — Bob deterministic 2nd-derived wallet + balance
- scripts/check-deployed-images.sh  — first diagnostic when smoke breaks
- scripts/repopulate-x402-cabundle.sh — manual CA bundle repopulate

Goal: future runs through this skill spend tokens on work, not re-deriving
facts the previous run already paid for.
The three scripts under .agents/skills/obol-stack-dev/scripts/ were
parallel implementations of logic that already lives in flows/lib.sh
or that the markdown documents inline:

- derive-bob.sh: duplicated the cast-keccak derivation already shown
  in references/paid-flows.md; flows/lib.sh::fund_bob_from_alice_if_needed
  is the version any flow actually consumes.
- repopulate-x402-cabundle.sh: the kubectl create-configmap recipe is
  already inline in references/troubleshooting.md, and obol stack up +
  obol sell http call x402verifier.PopulateCABundle automatically.
- check-deployed-images.sh: five kubectl get lines already documented
  inline in references/release-smoke-debugging.md "First Steps".

Net: -121 lines, no information lost, no external snippets to drift
out of sync with flows/lib.sh or the actual code paths.

Skill SKILL.md "Editing This Skill" updated to make the rule explicit:
inline shell snippets in markdown; do not ship parallel implementations.
Six fixes flagged during the obol-stack-dev skill rebuild:

CLI surface
- model: was "setup, status"; now lists "setup, setup custom, prefer,
  sync, list, remove, status" (setup custom is the canonical path
  for OBOL_LLM_ENDPOINT smoke flows)
- agent: add "auth --runtime <runtime> obol-agent" as the canonical
  Bearer-token command
- hermes: mark legacy "token" subcommand as not-recommended (it can
  print CLI usage text and poison the Bearer); point to `obol agent
  auth` instead

Build/test
- Differentiate the legacy local internal/openclaw integration matrix
  (Ollama-driven) from the release-gate flows 11/13/14 which require
  OBOL_LLM_ENDPOINT against vLLM/llama.cpp. Add the canonical
  release-smoke invocation.

Pitfalls (release-smoke 2026-05-13 root causes)
- Add "first diagnostic" header: confirm the deployed image with
  kubectl get deploy -o jsonpath='{...image}' before reading verifier
  code. A 503 from the verifier is almost never a real verifier bug.
- Rename pitfall 2 from openclaw-monetize-binding to the Hermes
  default obol-agent-monetize-rbac (legacy name kept as a footnote).
- New pitfall 9: EnsureVerifier overwrites helmfile under
  OBOL_DEVELOPMENT=true (the multi-hour bug — 5a10fb8).
- New pitfall 10: CAIP-2 vs legacy chain id mismatch in chains.go
  (silent 404 on every paid request).
- New pitfall 11: anvil --prune-history is enable-pruning, not
  retention; also --host 0.0.0.0 for cluster reachability.
- New pitfall 12: combo-form image-pin regex (:tag@sha256:digest must
  be the longest alternation arm).
- New pitfall 13: free-tier Base Sepolia RPC throttling under
  release-smoke load; BASE_SEPOLIA_RPC / ALCHEMY_BASE_SEPOLIA_API_KEY
  env support; secret scrubbing collapses paid-RPC URLs to TLD only.
- New pitfall 14: first-request flake on freshly-deployed verifier
  (12x5s retry in flow-07/08).
- Pointer to the fuller catalog at
  .agents/skills/obol-stack-dev/references/release-smoke-debugging.md.

Personal-data hygiene
- Replace hardcoded /Users/bussyjd/... checkout paths in "Related
  Codebases" with repository slugs + env override variables. Mention
  CLAUDE.local.md for users who want absolute paths checked in
  locally.
@bussyjd bussyjd merged commit 6f1f9ed into main May 13, 2026
6 checks passed
bussyjd added a commit that referenced this pull request May 13, 2026
Phase 7 of the post-#490 integration plan. Documents five attempts of
the new flows/buy-external.sh harness against the production seller at
https://inference.v1337.org/services/aeon as Bob (the deterministic
2nd-derived wallet from .env REMOTE_SIGNER_PRIVATE_KEY).

Outcome:
- Probe + 402 + accepts validation: green
- Permit2 authorization signing: green (1 auth for 0.023 OBOL)
- PurchaseRequest CR creation in hermes-obol-agent: green
- Controller reconciliation: BLOCKED — observedGeneration never advanced
  past 0; kubectl exec session SIGKILLed (exit 137). Hypothesis: the
  serviceoffer-controller does not reconcile PRs whose endpoint does
  not match any in-cluster ServiceOffer.spec.upstream (external-seller
  mode is unimplemented).
- Settlement: not attempted (gated on PR Ready). Bob OBOL balance
  unchanged across the test (4.978 OBOL).

Surfaced and fixed four bugs along the way:
- k3d 32-char cluster-name cap (7554b5e)
- CAIP-2 vs legacy chain id mismatch in the harness probe (3d8e231)
- Cloudflare WAF blocking Python-urllib UA in buy.py (c2dddc1)
- Stale .build/obol vs freshly-rebuilt .workspace/bin/obol (operator-
  level fix, not committed code — follow-up #2 in the report)

Includes follow-ups for the controller-side external-seller gap, the
harness binary-path normalization, a CF-WAF UA note for the
troubleshooting reference, and adding a KEEP_CLUSTER_ON_FAIL knob so
the next external-seller failure has live diagnostic state to
inspect.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants