You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
obol stack up today has three different lifecycles bolted together. Infrastructure and the Hermes agent come up declaratively via helmfile, but seller offers (obol sell http, obol sell inference) are imperative — created by direct kubectl apply/CR writes from the CLI, and re-hydrated on stack-up by a bespoke resumeSellOffers function (see PR #487).
This issue captures the longer-term direction: bring seller offers under the same declarative helmfile pass that already manages infrastructure and Hermes, so stack up is one mechanism applied uniformly.
Today (overcomplicated)
obol stack up
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
Infrastructure Agent (Hermes) Seller offers
(Traefik, eRPC, via hermes.Sync() (sell inference,
LiteLLM, sell http)
x402-verifier, │
serviceoffer- ▼
controller, NOT MANAGED BY
cloudflared, ...) STACK UP
│ │ │
│ │ ┌────────┴────────┐
▼ ▼ ▼ ▼
helmfile helmfile sell inference sell http
(declarative) (declarative) ↓ ↓
descriptor on YAML manifest
disk (JSON) on disk (PR #487)
│ │
▼ ▼
host gateway in-cluster only
foreground process
(#487 spawns detached;
TODO: helm chart)
──── resume gap filled with bespoke code ────
(cmd/obol/sell.go::resumeSellOffers)
Foreground host process asymmetry. Only `sell inference` runs a host-side gateway. That's why resume has to fork-and-detach for inference but not for http, and why `startDetachedInferenceGateway` + PID files exist at all.
Resume function exists only because seller offers aren't first-class infra. Every recovery scenario (`stack down`/`stack up`, `stack purge`, host reboot, agent reset) has to re-implement the recovery path. The controller and Hermes don't need any of that — they come back the same way infrastructure comes back.
Proposed end state
obol stack up
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
Infrastructure Agent Seller offers
│ │ │
└───────────┴──────────────┘
│
▼
SAME MECHANISM:
a helmfile pass over
declarative sources of
truth on disk
│
┌───────────────┴────────────────┐
▼ ▼
infra/*.yaml applications/
(already exists) ├── hermes/<id>/ (exists)
├── sell-http/<name>/ (new)
└── sell-inference/<name>/ (new)
│
▼
helmfile.yaml +
values-*.yaml per offer
In this shape:
`obol sell http` / `obol sell inference` become "edit the descriptor on disk + helmfile sync the slice". No imperative `kubectl apply`, no foreground process spawn.
`obol stack down`/`up` is "helmfile destroy/sync the whole tree" — agents and offers come back the same way the controller comes back.
The inference gateway becomes an in-cluster Deployment, so the host-side foreground process disappears entirely.
`resumeSellOffers`, `startDetachedInferenceGateway`, PID files, gateway logs — none of those need to exist. They are scaffolding around the asymmetry.
Migration path
Build the inference gateway as a Pod image.
Replaces `startDetachedInferenceGateway` and the PID-file plumbing.
The host-side subprocess is the only blocker to symmetric lifecycle.
Move sell-inference / sell-http to helmfile-managed slices under `applications/sell-{http,inference}//`.
Replaces `inference.Store` and the sell-http YAML manifest store with a single declarative format.
One walker, one parser, one test surface.
`obol stack up` becomes a single helmfile pass over everything.
Replaces `resumeSellOffers` entirely.
Recovery becomes "whatever's on disk is what's running".
Necessary today because seller offers are still imperative — without it, `stack up` doesn't bring back paid services and the spark2 dev cluster needs manual replay after every restart.
Ships now, unblocks spark2 and any other persistent dev cluster.
Superseded once the declarative model lands. The `startDetachedInferenceGateway` comment already points at the helm-chart future, and `resumeSellOffers` is explicitly scaffolding.
Acceptance criteria for closing this issue
Inference gateway runs as an in-cluster Deployment built from a Pod image, not a host subprocess.
Sell-http and sell-inference offers are represented on disk as helmfile-managed slices under `applications/`.
`obol stack up` requires no resume-specific code path for seller offers — the same helmfile pass that brings up infra and Hermes also brings up offers.
`resumeSellOffers`, `startDetachedInferenceGateway`, the PID file plumbing, and the two persistence stores are deleted.
Out of scope
Buyer-side resume. Buyer state is already cluster-resident (`PurchaseRequest` CRs + sidecar config).
Migration of existing on-disk descriptors written by older CLIs — a one-shot importer can live alongside the new layout if needed.
Context
obol stack uptoday has three different lifecycles bolted together. Infrastructure and the Hermes agent come up declaratively via helmfile, but seller offers (obol sell http,obol sell inference) are imperative — created by directkubectl apply/CR writes from the CLI, and re-hydrated on stack-up by a bespokeresumeSellOffersfunction (see PR #487).This issue captures the longer-term direction: bring seller offers under the same declarative helmfile pass that already manages infrastructure and Hermes, so
stack upis one mechanism applied uniformly.Today (overcomplicated)
Three problems with this shape
Proposed end state
In this shape:
Migration path
Where PR #487 fits
PR #487 (#487) is the deliberate near-term step:
Acceptance criteria for closing this issue
Out of scope