feat(stack): resume sell-inference offers on stack up#486
Closed
bussyjd wants to merge 6 commits into
Closed
Conversation
`obol sell inference` was producing ServiceOffers with three latent defects that together stopped /.well-known/agent-registration.json from ever being routed for inference-typed sells: 1. The inference subcommand never exposed `--register-*` flags (compare with `obol sell http` which does), even though the help text on `--register` says "registration is enabled by default". 2. `buildInferenceServiceOfferSpec` never wrote `spec.registration` onto the offer at all, so the controller's `reconcileRegistrationStatus` saw `Enabled=false` (zero value) and emitted `Registered=True Disabled` with no RegistrationRequest CR and no /.well-known HTTPRoute. 3. `buildInferenceServiceOfferSpec` hardcoded `spec.model.name = "ollama"` regardless of the actual `--model` value, so anything downstream that keyed off the model id (the controller's description default included) was looking at the wrong string. Surfaced today on spark2 while trying to fetch `https://inference.v1337.org/.well-known/agent-registration.json` — got the Traefik fall-through 404. Manual `kubectl patch` enabled registration, exposed a fourth defect: 4. `buildActiveRegistrationDocument` in the serviceoffer-controller unconditionally overwrote `Spec.Registration.Description` for inference offers with `"<model.name> inference via x402 micropayments"`, even when the operator had supplied an explicit description at sell time. This PR fixes all four: - Add `--no-register`, `--register-name`, `--register-description`, `--register-image`, `--register-skills`, `--register-domains`, `--register-metadata` to `obol sell inference`. - Rename `sellHTTPRegistrationInput` / `buildSellHTTPRegistrationConfig` to the unqualified `sellRegistrationInput` / `buildSellRegistrationConfig` since they now serve both inference and http call sites. - Extend `buildInferenceServiceOfferSpec` to accept the resolved model name and the registration block, write `spec.model.name = <real model id>`, and merge the registration block into `spec.registration` when non-empty. - In the controller's `buildActiveRegistrationDocument`, only fall back to the model-aware description string when the operator left `Spec.Registration.Description` empty. Tests that would have caught the regression earlier: - `TestSellInference_Flags` now requires the six registration flags; their absence on `main` was the bug. - `TestBuildInferenceServiceOfferSpec_RegistrationEnabledByDefault` pins that defaults produce `spec.registration.enabled = true` and the offer name as `spec.registration.name`. - `TestBuildInferenceServiceOfferSpec_NoRegisterOmitsRegistration` pins the `--no-register` opt-out. - `TestBuildInferenceServiceOfferSpec_OperatorOverridesWin` pins that operator-supplied name/description/image/skills/domains all survive into the spec verbatim. - `TestBuildInferenceServiceOfferSpec_ModelNameNotHardcoded` pins that `spec.model.name` reflects `--model`, not the historical "ollama" literal. - `TestBuildActiveRegistrationDocument_KeepsOperatorDescription` pins the controller-side fix: an operator description survives the buildActiveRegistrationDocument pass. - `TestBuildActiveRegistrationDocument_FallsBackToModelDescriptionForInference` pins the other branch — inference offers with no operator description still get the model-aware default, not the generic one. Drive-by: `TestSellInference_Flags` was failing on `main` after #470 removed the `--price` default but didn't update the corresponding assertion. Updated to assert `--price` default = "" with a comment explaining the contract.
The first revision of #485 stopped at "produce a ServiceOffer with a populated spec.registration block". That made /.well-known/agent-registration.json get routed, but left the offer in Registered=False AwaitingExternalRegistration until someone manually ran `obol sell register`. The serviceoffer-controller's storefront filter (buildServiceCatalogJSON in render.go) requires Ready=True, which transitively requires Registered=True, so the offer was silently excluded from /api/services.json — the very feed the operator's own storefront UI consumes. obol sell http already auto-registers on the same code path. Mirror that here so the inference path reaches the same end state without a follow-up obol sell register step. Changes: - Extract shouldAutoRegisterSell(spec, tunnelURL) as the shared decision predicate. The same gate now drives both the http and inference auto-register call sites; defensively returns false when the registration block is missing, disabled, malformed, or the tunnel URL is empty. - sellInferenceCommand Action: after kubectlApply + EnsureTunnelForSell succeed, if shouldAutoRegisterSell says yes, call autoRegisterServiceOffer with the resolved name/description/wallet pulled from the spec. Surface failures as warnings + a re-run hint rather than aborting the gateway start, because the underlying call needs gas on the target chain and we do not want a one-off RPC hiccup to block local dev. Test coverage that would have caught the regression: - TestShouldAutoRegisterSell - table-driven over six scenarios including the defensive cases (registration not a map, registration.enabled not a bool). Both call sites use the helper. - TestSellInferenceAction_InvokesAutoRegister - source-level guard that scans the sellInferenceCommand body for shouldAutoRegisterSell and autoRegisterServiceOffer. The bug we just fixed was "Action calls neither"; an innocent refactor could remove the calls without any unit-level signal otherwise. Operational note: the agent's remote-signer wallet needs a small ETH balance on the target chain (~0.20-0.50 USD typical) for the on-chain register tx. If the wallet is unfunded the warning fires and the offer stays in AwaitingExternalRegistration; the operator can fund the wallet and re-run obol sell register to finish.
…straint) The autoRegisterServiceOffer pre-flight check rejected any registration whose signer didn't match the offer's payTo wallet: registration signer 0xA... does not match the payment wallet 0xB... Use a matching signer, omit --wallet so the remote-signer wallet is used, or pass --no-register The error wording read like an ERC-8004 limitation but isn't. ERC-8004 treats the agent OWNER (msg.sender at register time) and the agent WALLET (settable post-mint via setAgentWallet) as independent addresses. x402 settlement honors the offer's spec.payment.payTo directly — buyers pay that address regardless of what the registry's getAgentWallet returns. The "hot signer, cold/multisig payee" split is the canonical pattern. The historic guard existed because the obol CLI never exposed setAgentWallet, so a mismatched registration left operators with no in-CLI recovery path. This change instead surfaces the split as an informational note + adds `obol sell update <name> --pay-to <new>` as the recovery surface (already in tree; just needed test coverage and the connection wired into the diagnostic). Changes: - signerPayeeDelegationNote(signer, payTo) returns a human-readable note when the two diverge (case-insensitive, whitespace-tolerant, empty on either side) and "" otherwise. Used by autoRegisterServiceOffer instead of the previous early-return. - buildSellUpdatePatch(payTo, chain, price) extracted from the inline sellUpdateCommand Action so the patch shape — the thing that actually hits the cluster — is testable without a live offer. Action calls the helper instead of inlining the same logic. Tests: - TestSignerPayeeDelegationNote — 6-case table: match, case-insensitive, whitespace, empty payTo, empty signer (defensive), true mismatch (assertions name the addresses + advise sell update). - TestAutoRegister_AllowsSignerPayeeMismatch — source-level guard asserting the banned error wording is gone from autoRegisterServiceOffer and the soft-notice path is wired. Anyone re-introducing the check has to delete this test too, which forces them to read the rationale. - TestBuildSellUpdatePatch_PayToOnly — `obol sell update <name> --pay-to 0xBooB` builds a patch that touches only spec.payment.payTo, not network or price. - TestBuildSellUpdatePatch_PriceSwitchNullsOldKeys — table over perRequest/perMTok/perHour: the unused keys are explicitly null so a switch (e.g. perRequest → perMTok) doesn't leave the previous key fighting through merge semantics. - TestBuildSellUpdatePatch_NoFieldsErrors — error fires when no fields are set, and the error names the flags the operator should pass. - TestSellUpdate_PayToFlagSurface — `obol sell update` exposes --pay-to (with --wallet/--recipient/-w aliases via payToFlag), and --namespace is Required.
Closes the "cluster comes back but the seller offers don't" gap. After `obol stack down` destroys the k3d cluster, all ServiceOffer custom resources are wiped from etcd. The descriptors on disk at ~/.workspace/config/inference/<name>/ survive, but nothing replays them when the cluster comes back, so operators had to manually re-run `obol sell inference <name>` for every offer after every stack up. This wires a resume step into `obol stack up`: after stack.Up returns successfully, the action iterates inference.Store, rebuilds the Service+Endpoints+ServiceOffer for each persisted deployment from the on-disk descriptor, and kubectl-applies them. The foreground host gateway is NOT auto-started — it is an interactive operator action and stack-up shouldn't launch long-running processes — but the cluster side is fully reattached so the operator's eventual `obol sell inference <name>` rerun hits a "service already healthy" reconcile instead of "build from scratch." Storage model: - inference.Deployment gains ModelName, ServiceNamespace, and Registration fields, persisted with `omitempty` so legacy descriptors written by older binaries still load (the new fields come back as zero values; the resume path either defaults sensibly or refuses with an actionable message). - The inference Action now resolves the registration block once and passes the same map to both store.Create() and the in-process ServiceOffer apply — guaranteeing on-disk state matches what the cluster sees. Resume path: - resumeSellOffers(ctx, cfg, u): lists inference.Store, skips when no kubeconfig (no cluster yet), warns + continues per-offer on errors. - resumeOneInferenceOffer: createHostService + buildInferenceServiceOfferSpec + kubectlApply, no foreground process. Returns an actionable error on missing ModelName so legacy descriptors surface a clear "recreate the offer" prompt rather than producing a broken ServiceOffer. - Wired from cmd/obol/main.go's `stack up` Action after stack.Up. A resume failure is logged as a warning, not propagated — stack-up must succeed even if one descriptor is malformed. Tests: - TestStoreCreate_PersistsResumeFields: pins ModelName, ServiceNamespace, and Registration round-trip through JSON. Without this round-trip, the resume path silently loses operator customizations. - TestStoreCreate_LegacyDescriptorWithoutResumeFields: pins backwards-compatibility — a JSON written by an older binary still loads, with the new fields as zero values. - TestResumeSellOffers_EmptyStoreNoOp: empty store, resume returns nil. - TestResumeSellOffers_DescriptorPresentButNoCluster: descriptor on disk, kubeconfig absent (post-purge, pre-up state) — resume must skip cleanly. - TestResumeOneInferenceOffer_RequiresModelName: legacy descriptor with empty ModelName surfaces an actionable error naming the missing field and the recovery command. - TestResumeOneInferenceOffer_NilDescriptor: defensive nil/empty descriptor guard so one bad entry can't crash the loop. - TestStackUpAction_CallsResumeSellOffers: source-level guard that cmd/obol/main.go's `stack up` Action calls resumeSellOffers AFTER stack.Up. Pinning the order — running resume before stack.Up would see no kubeconfig and skip every offer silently. Operational note: `obol sell http` offers don't yet have an on-disk descriptor (only `obol sell inference` persists), so they aren't covered by this resume pass. Filed as a follow-up — adding a parallel manifest store for `sell http` plus a third resume branch is its own PR.
…ont filter
The first cut of stack-up resume (prior commit) reattached cluster-side
artifacts but stopped short of two things that together left the
operator's storefront empty after every stack-up cycle:
1. The foreground x402 gateway never restarted, so UpstreamHealthy=False.
2. Even after a manual gateway restart, the controller's storefront
filter required Ready=True — which itself requires Registered=True
— and an unfunded agent wallet (or a deliberate "register later"
choice) leaves Registered=False AwaitingExternalRegistration. The
operationally-usable offer was invisible to the operator's own UI.
Both addressed here so `obol stack up` is the only command needed.
1. Auto-start the gateway as a detached host subprocess:
- startDetachedInferenceGateway reconstructs the original
`obol sell inference <name> --model … --register-* …` invocation
from the persisted inference.Deployment and spawns it with
Setsid + Process.Release so it survives the parent's exit.
- PID + log path live under <StateDir>/sell-inference/<name>/.
readGatewayPID + processAlive let subsequent stack-ups detect a
still-running gateway and skip the relaunch.
- Surfaced via a u.Successf with the PID + log path so operators can
`tail -f` or `kill $(cat .pid)` without parsing helpers.
- sell_proc_unix.go isolates the platform-specific SysProcAttr
behind the unix build tag, keeping the rest of the code portable.
The long-term shape is to ship the inference gateway as a real
helm-managed cluster Pod (it would survive stack-down/up like every
other infra piece, and the controller could observe it as a
first-class workload). Comment in startDetachedInferenceGateway
spells that out so the next person looking at this knows the host
subprocess + PID-file plumbing is the deliberate near-term step,
not the final form.
2. Relax the storefront filter:
- offerOperationallyReady = ModelReady + UpstreamHealthy +
PaymentGateReady + RoutePublished. Registered is intentionally
NOT in the AND — on-chain ERC-8004 registration is publication
metadata, not operational readiness. Aggregate Ready=True is
also accepted as a shortcut so existing test fixtures and any
external callers that only emit the aggregate signal still work.
- offerAwaitingRegistration flags
Registered=False AwaitingExternalRegistration specifically, and
buildServiceCatalogJSON propagates it as
ServiceCatalogEntry.RegistrationPending=true so storefront UIs
can badge "registration pending" alongside the usable offer.
Tests:
- TestBuildResumeGatewayArgs (3 cases): full descriptor incl.
registration map, --no-register path, legacy descriptor with no
Registration map. Pins positional-name-before-flags ordering so a
CLI surface tweak can't silently desync the resume relaunch.
- TestReadGatewayPID (8 cases): clean integer, trailing newline,
surrounding whitespace, zero rejected, negative rejected,
non-numeric rejected, empty rejected, missing-file rejected.
Format is one decimal int in ASCII so external tooling
(`kill $(cat .pid)`) works without parsing.
- TestProcessAlive_SelfAndBogus: self is alive, absurd PID is not.
- TestOfferOperationallyReady_IncludesAwaitingExternalRegistration:
the headline behavioral fix — an offer with all four ops conditions
True + Registered=False(AwaitingExternalRegistration) is
operationally ready.
- TestOfferOperationallyReady_RejectsRealNotReady: an offer with
UpstreamHealthy=False is still excluded — the relax is narrow to
the registration gate only.
- TestBuildServiceCatalogJSON_IncludesPendingRegistrationOffers:
end-to-end through /api/services.json — AwaitingExternalRegistration
offers appear with RegistrationPending=true.
- TestBuildServiceCatalogJSON_RegistrationPendingFalseForFullyReady:
the negative — fully Ready=True offers must NOT carry
RegistrationPending or the storefront badge would stick around on
healthy offers.
Operational footnote: the auto-spawned gateway picks up the same
inference.Deployment that originally created the offer. If the
descriptor predates this PR and has no ModelName, the resume per-offer
warning fires (already pinned by TestResumeOneInferenceOffer_RequiresModelName
in the prior commit) — recreate the offer once with the new code so
the descriptor has the resume-side fields, after which every
subsequent stack down/up cycle reattaches the offer with zero extra
commands.
Surfaced during the live spark2 walkthrough of the prior commit: 1. Gateway PID printed as -1. cmd.Process.Release() on Unix sets p.Pid to -1 as part of its handle teardown, so reading cmd.Process.Pid AFTER Release prints a bogus -1 to the operator and pins -1 in the PID file. Snapshot the PID before the Release call. 2. Stale "Host gateways are not auto-started" footer. Carried over from the pre-auto-spawn version of the resume function. With the gateway now auto-spawned, the message is misleading. Replaced with a pointer to the gateway log path so operators can `tail -f` immediately. 3. /skill.md catalog used the strict Ready=True filter while /api/services.json used the new operationally-ready filter. The two surfaces should stay consistent — an offer that's usable for x402 payments shows up in BOTH the storefront feed and the operator- facing skill catalog. Switched buildSkillCatalogMarkdown to the same offerOperationallyReady predicate. No new tests; the existing TestOfferOperationallyReady_* / catalog tests cover the filter relax for both call sites since they share the predicate. The PID-reading order is a one-line semantic fix; the footer wording isn't worth a test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the "cluster comes back but the seller offers don't" gap that bit us on spark2: after
obol stack downdestroys the k3d cluster, allServiceOfferCRs are wiped from etcd while the on-disk descriptors at~/.workspace/config/inference/<name>/survive. Nothing replayed them, so operators had to re-runobol sell inference <name>for every offer after every stack up.This wires a resume step into
obol stack up. Afterstack.Upreturns, the action iteratesinference.Store, rebuilds theService+Endpoints+ServiceOfferfor each persisted deployment, andkubectl applys them.The foreground host gateway is NOT auto-started — that's an interactive operator action and
obol stack upshouldn't launch long-running processes. But the cluster side is fully reattached, so the operator's eventualobol sell inference <name>rerun hits a "service already healthy" reconcile path instead of building everything from scratch.Why this PR depends on #485
inference.Deploymentgains three new fields (ModelName,ServiceNamespace,Registration) that have to be populated at offer-creation time, and the populating happens in the inference Action that #485 introduces (buildSellRegistrationConfig+ the--register-*flag handling). Rebasing this onto main once #485 lands is a no-op; before that it compiles only on top of #485.Behaviour matrix
obol stack upobol sell inferenceagain per offermodel_name"deployment X is missing model_name on disk — recreate with obol sell inference X --model <id>"service_namespacellmllm(same end state, but explicit)Test coverage that catches what we already broke once
TestStoreCreate_PersistsResumeFieldsModelName/ServiceNamespace/Registrationround-trip through JSON. Without this, resume silently loses operator customizations.TestStoreCreate_LegacyDescriptorWithoutResumeFieldsTestResumeSellOffers_EmptyStoreNoOpTestResumeSellOffers_DescriptorPresentButNoClusterTestResumeOneInferenceOffer_RequiresModelNameModelNamesurfaces the recovery command verbatim.TestResumeOneInferenceOffer_NilDescriptorTestStackUpAction_CallsResumeSellOfferscmd/obol/main.go'sstack upAction must callresumeSellOffers, AND it must do so AFTERstack.Up(running it before sees no kubeconfig and skips every offer silently).Follow-ups not in this PR
obol sell httpresume:sell httpoffers don't have an on-disk descriptor today (onlysell inferencepersists viainference.Store). Adding a parallel manifest store forsell httpplus a third resume branch is its own PR. Worth noting because the user explicitly asked for "all paid services" — this PR covers inference only.obol sell inference <name>. A future PR could turn the gateway into a supervised process (systemd unit, k8s Deployment, orobol sell upbackground mode), at which point stack-up could bring the entire seller path back unattended.Generated with Claude Code