Background
ServiceOffer should support paid inference capacity without turning Obol into a Kubernetes scheduler or trying to manage the inference runtime.
For the first implementation, assume the OpenAI-compatible/vLLM-like inference endpoint is outside the Obol stack cluster. Obol has exactly one upstream endpoint to work with and zero control over scaling. This matches host-local vLLM / llama.cpp / DGX Spark style setups where the serving process is already running and the cluster only needs to expose it safely through x402.
This is intentionally the minimal-surface issue. In-stack multi-GPU inference servers, GPU device plugins, KEDA/HPA, RuntimeClasses, and Kubernetes-managed vLLM deployments are separate work. See #430.
Goals
- Add a minimal capacity/admission envelope to
ServiceOffer for external inference endpoints.
- Use Kubernetes/Gateway primitives we already have instead of inventing a scheduler.
- Keep the first version compatible with the current Traefik + Gateway API + x402 ForwardAuth architecture.
- Treat GPU count as the relevant capacity dimension for inference, not generic cloud CPU scaling.
- Avoid any autoscaling semantics for this phase: one external endpoint, static seller-declared capacity.
- Prevent buyers from paying into an obviously saturated service.
Non-goals
- Do not run vLLM/llama.cpp inside the Obol cluster in this issue.
- Do not create or manage
Deployment, StatefulSet, HPA, KEDA ScaledObject, RuntimeClass, or GPU device plugin resources here.
- Do not scale replicas based on CPU/memory metrics.
- Do not create one Kubernetes object per request/reservation.
- Do not implement full token metering/per-token settlement here; this is admission/capacity protection around the existing x402 route.
Proposed ServiceOffer shape
Keep this small and seller-facing:
apiVersion: obol.org/v1alpha1
kind: ServiceOffer
metadata:
name: qwen36-fast
namespace: llm
spec:
type: inference
upstream:
service: qwen36-external
namespace: llm
port: 8000
healthPath: /health
capabilities:
protocol: openai-chat
streaming: true
contextWindowTokens: 262144
requestLimits:
maxInputTokens: 8192
maxOutputTokens: 1024
maxBodyBytes: 10485760
capacity:
mode: external-static
gpuCount: 1
maxInFlight: 16
perBuyerMaxInFlight: 4
maxQueueDepth: 0
payment:
network: base-sepolia
asset: usdc # usdc | obol | 0x...
payTo: "0x..."
scheme: exact
price:
perRequest: "0.001"
perMTok: "0.50"
Notes:
capacity.mode: external-static means Obol does not control scaling.
gpuCount is capacity metadata and preset input, not a Kubernetes scheduling instruction.
maxInFlight is the actual hard admission ceiling enforced at the gateway/proxy layer.
maxQueueDepth: 0 should be the safe default for paid inference: if saturated, return 429 and do not request payment.
perBuyerMaxInFlight is desirable, but can be implemented after global maxInFlight if Traefik/header-ordering makes it non-trivial.
payment.asset should be explicit because Obol supports both USDC and OBOL-token settlement. Default to usdc for backwards compatibility, and resolve symbolic assets (usdc, obol) to the network-specific token contract/decimals when writing x402 payment requirements.
Kubernetes/Gateway primitives to use
Existing resources
Continue generating/using:
HTTPRoute for /services/<offer>
- Traefik
Middleware for x402 ForwardAuth
- x402 pricing
ConfigMap route entry
- owner references for garbage collection
ServiceOffer.status.conditions
Add generated capacity middleware
For Traefik, generate an additional middleware from spec.capacity.maxInFlight:
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: qwen36-fast-inflight
namespace: llm
spec:
inFlightReq:
amount: 16
Attach it to the same route as the x402 middleware. The desired behavior is:
- If capacity is available: unpaid request reaches x402 verifier and gets a
402 Payment Required quote.
- If capacity is saturated: request gets
429 Too Many Requests and should not be quoted for payment.
Implementation detail to verify: middleware ordering must avoid asking the buyer to pay when the global in-flight limit is already saturated.
Optional request-shape guard
If easy, enforce maxBodyBytes with gateway/proxy middleware. Token-shape limits (maxInputTokens, maxOutputTokens) may initially be status/discovery metadata unless the x402 verifier/proxy parses OpenAI requests.
Admission behavior
buyer request
-> Gateway/HTTPRoute
-> capacity middleware / admission check
saturated: 429, Retry-After, no payment quote
-> x402 ForwardAuth
missing payment: 402 quote
valid payment: pass
-> upstream Service pointing at external endpoint
For v1, we can avoid reservations entirely. A reservation layer can come later if we see buyer UX issues between quote and paid retry.
Status fields
Add enough status to make obol sell status and discovery honest:
status:
capacity:
mode: external-static
scalingControlled: false
gpuCount: 1
maxInFlight: 16
perBuyerMaxInFlight: 4
inFlight: 7
state: Available # Available | Saturated | Unknown
endpoint: https://.../services/qwen36-fast
inFlight can come from Traefik/x402 metrics when available; otherwise report Unknown rather than pretending.
CLI UX
Minimal explicit form:
obol sell inference qwen36-fast \
--upstream qwen36-external \
--namespace llm \
--port 8000 \
--runtime openai \
--payment-asset usdc \
--price 0.001 \
--gpu-count 1 \
--max-inflight 16 \
--per-buyer-max-inflight 4 \
--max-input-tokens 8192 \
--max-output-tokens 1024
Optional preset form later:
obol sell inference qwen36-fast \
--upstream qwen36-external \
--runtime openai \
--capacity-preset external-vllm-1gpu-balanced
The preset should expand into static gateway limits only. It should not imply autoscaling.
Acceptance criteria
Open questions
- Should
gpuCount be required for type: inference capacity presets, or optional metadata only?
- Do we need per-buyer concurrency in v1, or is global max-in-flight enough for the first cut?
- Can Traefik middleware ordering give us
429 before 402 cleanly, or do we need a tiny x402-aware admission shim later?
- Should
maxQueueDepth be omitted entirely for v1 and default to no queue?
- Should multi-asset offers use a future
payment.accepts[] list, or should one ServiceOffer advertise exactly one settlement asset for minimal v1?
Background
ServiceOffershould support paid inference capacity without turning Obol into a Kubernetes scheduler or trying to manage the inference runtime.For the first implementation, assume the OpenAI-compatible/vLLM-like inference endpoint is outside the Obol stack cluster. Obol has exactly one upstream endpoint to work with and zero control over scaling. This matches host-local vLLM / llama.cpp / DGX Spark style setups where the serving process is already running and the cluster only needs to expose it safely through x402.
This is intentionally the minimal-surface issue. In-stack multi-GPU inference servers, GPU device plugins, KEDA/HPA, RuntimeClasses, and Kubernetes-managed vLLM deployments are separate work. See #430.
Goals
ServiceOfferfor external inference endpoints.Non-goals
Deployment,StatefulSet,HPA,KEDA ScaledObject,RuntimeClass, or GPU device plugin resources here.Proposed
ServiceOffershapeKeep this small and seller-facing:
Notes:
capacity.mode: external-staticmeans Obol does not control scaling.gpuCountis capacity metadata and preset input, not a Kubernetes scheduling instruction.maxInFlightis the actual hard admission ceiling enforced at the gateway/proxy layer.maxQueueDepth: 0should be the safe default for paid inference: if saturated, return429and do not request payment.perBuyerMaxInFlightis desirable, but can be implemented after globalmaxInFlightif Traefik/header-ordering makes it non-trivial.payment.assetshould be explicit because Obol supports both USDC and OBOL-token settlement. Default tousdcfor backwards compatibility, and resolve symbolic assets (usdc,obol) to the network-specific token contract/decimals when writing x402 payment requirements.Kubernetes/Gateway primitives to use
Existing resources
Continue generating/using:
HTTPRoutefor/services/<offer>Middlewarefor x402ForwardAuthConfigMaproute entryServiceOffer.status.conditionsAdd generated capacity middleware
For Traefik, generate an additional middleware from
spec.capacity.maxInFlight:Attach it to the same route as the x402 middleware. The desired behavior is:
402 Payment Requiredquote.429 Too Many Requestsand should not be quoted for payment.Implementation detail to verify: middleware ordering must avoid asking the buyer to pay when the global in-flight limit is already saturated.
Optional request-shape guard
If easy, enforce
maxBodyByteswith gateway/proxy middleware. Token-shape limits (maxInputTokens,maxOutputTokens) may initially be status/discovery metadata unless the x402 verifier/proxy parses OpenAI requests.Admission behavior
For v1, we can avoid reservations entirely. A reservation layer can come later if we see buyer UX issues between quote and paid retry.
Status fields
Add enough status to make
obol sell statusand discovery honest:inFlightcan come from Traefik/x402 metrics when available; otherwise reportUnknownrather than pretending.CLI UX
Minimal explicit form:
Optional preset form later:
The preset should expand into static gateway limits only. It should not imply autoscaling.
Acceptance criteria
ServiceOfferCRD accepts a minimal external-static capacity envelope.inFlightReqmiddleware forspec.capacity.maxInFlight.429/Retry-After, not402.402terms.obol sell statusshows capacity mode,scalingControlled=false, GPU count, max in-flight, and current/unknown in-flight state.obol sell http/sell inferencebehavior keeps working when capacity fields are omitted.payment.asset/--payment-asset) with backwards-compatible defaultusdc.Open questions
gpuCountbe required fortype: inferencecapacity presets, or optional metadata only?429 before 402cleanly, or do we need a tiny x402-aware admission shim later?maxQueueDepthbe omitted entirely for v1 and default to no queue?payment.accepts[]list, or should oneServiceOfferadvertise exactly one settlement asset for minimal v1?