chore(buy-external): preserve cluster on FAIL, normalize obol-bin path, document CF-WAF UA#493
Merged
bussyjd merged 5 commits intoMay 15, 2026
Conversation
…ot on FAIL When `flows/buy-external.sh` fails (typically at step 14, the `buy.py buy` invocation), the existing `external_cleanup` immediately tears the cluster down — destroying the only places that record why the PurchaseRequest never advanced (controller logs, PR `status.conditions[]`, sidecar `/status`). This commit: - Adds `external_snapshot_on_fail()` — best-effort capture of controller logs (current + `--previous`), PurchaseRequest YAML across all namespaces, buyer sidecar `/status` (via `kubectl exec ... python3` against the litellm container — buyer container is distroless), `cluster-pods.txt`, and recent `cluster-events.txt`. All commands wrapped in `|| true` so a single failure doesn't abort the bundle. Empty/failed files are removed. - Calls the snapshot from `external_cleanup` BEFORE any teardown, on the failure path only — clean exits keep the existing fast-cleanup behavior. - Honors `KEEP_CLUSTER_ON_FAIL=1` (default unset) — when set, skips `bob stack down` after the snapshot bundle is written and prints the preserved stack id + artifact dir + manual cleanup hint. Unblocks investigation of v1337-style external-seller failures documented in plans/inference-v1337-buy-report-20260514.md.
…otstrap `bootstrap_flow_workspace` previously copied unconditionally from the caller-supplied path (always `$OBOL_ROOT/.build/obol`). When iterating on embedded skill content (e.g. `internal/embed/skills/buy-x402/scripts/buy.py`) it's easy to rebuild one of the two binaries and forget the other, silently baking pre-fix files into the cluster PVC via `syncObolSkills`. Burned six hours during the v1337 live-buy investigation (attempt 5 in plans/inference-v1337-buy-report-20260514.md). Now: stat both paths, pick the one with the larger mtime, and emit a 5-line WARN to stderr when the two differ by more than 5 minutes — header + both paths-with-mtimes + which one was picked + a one-line rebuild nudge. Cross- OS stat handled via `stat -c %Y` with `stat -f %m` fallback. Date formatted with `date -r <file>` (BSD/macOS friendly), GNU `date -u -d "@<epoch>"` fallback. Contract preserved (no return value, copies into `$dir/bin/obol`).
Adds entry #10 to the release-smoke debugging reference covering the HTTP 403 + Cloudflare error 1010 we hit on v1337 attempts 3–4: managed WAF rules block the default `Python-urllib/X.Y` UA. Documents the buy.py fix (commit c2dddc1) plus the unconfirmed-but-likely Go-side follow-up at internal/serviceoffercontroller/purchase.go:183, where Go's `http.Client` defaults to `User-Agent: Go-http-client/1.1` and may hit the same WAF block on the controller probe.
Re-ran the v1337 buy with the new KEEP_CLUSTER_ON_FAIL=1 knob (commit b749f95). The controller reconciled the PurchaseRequest in 55 seconds through Probed → AuthsLoaded → Configured → Ready, against the same external endpoint the original report failed on. The original report's central technical claim — "serviceoffer-controller does not reconcile PurchaseRequests for external sellers" — is false. The controller is endpoint-agnostic by design (verified by code review of internal/serviceoffercontroller/purchase.go). Attempt 5's reconcile-hang was almost certainly a kubectl-exec session SIGKILL (exit 137), not a controller bug — likely harness-side run_with_timeout firing while buy.py was still polling normally. Today's run did surface a real but unrelated quirk: LiteLLM's POST /model/new fails with EROFS because /etc/litellm/config.yaml is mounted read-only as a Kubernetes ConfigMap volume; the controller catches this and falls back to ConfigMap reload, which works fine. Pre-existing, worth one line in paid-flows.md so the next debugger isn't startled. Step 18 (paid request) failed for an operator-error reason: I picked qwen3.6-27b as the upstream model id, but v1337's vLLM serves under a different name. Bob's 0.023 OBOL was NOT consumed (LiteLLM 404'd before the buyer sidecar could settle). Companion to plans/inference-v1337-buy-report-20260514.md. Retracts follow-up #1 of that report.
…_FAIL knob Replaces the opt-in KEEP_CLUSTER_ON_FAIL=1 env knob (added in b749f95) with an unconditional rule: cleanup happens iff every step passes. On FAIL, snapshot the diagnostic bundle and preserve the cluster — every time, no env override needed. Also inverts the prior success-side default. The previous design left the cluster up on success "so the operator can poke around"; in practice operators re-ran the harness from scratch when they wanted fresh state, and the leftover cluster mostly leaked across runs. With the new gate, a green run leaves a clean machine. Net behavior: - success → bob stack down (clean state for next run) - failure → snapshot + preserve (operator pays one manual teardown when done diagnosing) The diagnostic snapshot helper from b749f95 is unchanged; only the preservation gate moved from an env knob to the implicit pass/fail state.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase-1 polish + investigation follow-ups from
plans/inference-v1337-buy-report-20260514.md. Net result: the report's central technical claim ("controller doesn't reconcile external-seller PurchaseRequests") is retracted — re-running with the new diagnostic gate proved the controller is endpoint-agnostic by design.Stacks on #492 (folds in #487 + #489 + post-490 cleanups). Targets
integration/post-490-cleanupsso the diff shows only the 4 commits this PR adds.What's in this branch
feat(buy-external): KEEP_CLUSTER_ON_FAIL knob + diagnostic snapshot on FAIL(b749f95) — addsexternal_snapshot_on_fail(). On FAIL, snapshots controller logs (current +--previous), PR YAML across all namespaces, buyer sidecar/status(viakubectl exec ... python3against the litellm container — buyer is distroless),cluster-pods.txt, and recentcluster-events.txtto the artifact dir before any teardown.fix(flows): pick freshest of .build/obol vs .workspace/bin/obol(eb13055) —bootstrap_flow_workspaceinflows/lib.shnow stats both paths, picks the larger mtime, and emits a 5-line WARN when they differ by >5 minutes. Removes the silent-stale-binary footgun documented as v1337 attempt 5.docs(skill): document Cloudflare-WAF UA pitfall(849cd93) — entry Add ADK #10 inrelease-smoke-debugging.mdcovering HTTP 403 + Cloudflare error 1010 from defaultPython-urllibUA, thec2dddc1buy.py fix, and the (unconfirmed) follow-up about Go'shttp.Clientdefaults atpurchase.go:183.docs(plans): retract v1337 controller-gap hypothesis(82108c3) —plans/inference-v1337-followup-20260514.mdcompanion to the original report. Re-run on spark1 showed the controller reconciles in 55s throughProbed → AuthsLoaded → Configured → Ready. The Go-side probe was NOT WAF-blocked.refactor(buy-external): green-only cleanup gate, drop KEEP_CLUSTER_ON_FAIL knob(df5fcff) — replaces the opt-in env knob with an unconditional rule: cleanup happens iff every step passes. Also inverts the prior success-side default (which left the cluster up on success "so the operator can poke around" — in practice operators re-ran from scratch and the leftover cluster mostly leaked).Net cleanup behavior
bob stack down— clean state for next runbob stack downwhen doneNo env knob; the pass/fail exit code is the gate.
Side findings (worth knowing)
The captured controller log surfaces a pre-existing LiteLLM hot-add quirk:
LiteLLM's
/model/newAPI tries to write back to the ConfigMap volume (read-only by Kubernetes default). The controller catches the 400 and falls back to the ConfigMap-reload path, which works. Not external-seller specific. Worth a one-liner inpaid-flows.mdso it stops surprising next-debugger.Test plan
bash -n flows/buy-external.sh && bash -n flows/lib.sh— cleanchore/buy-external-followupsagainsthttps://inference.v1337.org/services/aeon:Ready=Trueafter 55s,observedGeneration: 1,paid/qwen3.6-27bpublished,remaining: 1,spent: 0qwen3.6-27b≠ v1337's actual model id); Bob's 0.023 OBOL pre-signed auth was NOT consumedlib.sh::bootstrap_flow_workspacetouched among shared code; signature preserved, contract unchanged)