Skip to content

feat: minimal external inference capacity envelope for ServiceOffer #439

@bussyjd

Description

@bussyjd

Background

ServiceOffer should support paid inference capacity without turning Obol into a Kubernetes scheduler or trying to manage the inference runtime.

For the first implementation, assume the OpenAI-compatible/vLLM-like inference endpoint is outside the Obol stack cluster. Obol has exactly one upstream endpoint to work with and zero control over scaling. This matches host-local vLLM / llama.cpp / DGX Spark style setups where the serving process is already running and the cluster only needs to expose it safely through x402.

This is intentionally the minimal-surface issue. In-stack multi-GPU inference servers, GPU device plugins, KEDA/HPA, RuntimeClasses, and Kubernetes-managed vLLM deployments are separate work. See #430.

Goals

  • Add a minimal capacity/admission envelope to ServiceOffer for external inference endpoints.
  • Use Kubernetes/Gateway primitives we already have instead of inventing a scheduler.
  • Keep the first version compatible with the current Traefik + Gateway API + x402 ForwardAuth architecture.
  • Treat GPU count as the relevant capacity dimension for inference, not generic cloud CPU scaling.
  • Avoid any autoscaling semantics for this phase: one external endpoint, static seller-declared capacity.
  • Prevent buyers from paying into an obviously saturated service.

Non-goals

  • Do not run vLLM/llama.cpp inside the Obol cluster in this issue.
  • Do not create or manage Deployment, StatefulSet, HPA, KEDA ScaledObject, RuntimeClass, or GPU device plugin resources here.
  • Do not scale replicas based on CPU/memory metrics.
  • Do not create one Kubernetes object per request/reservation.
  • Do not implement full token metering/per-token settlement here; this is admission/capacity protection around the existing x402 route.

Proposed ServiceOffer shape

Keep this small and seller-facing:

apiVersion: obol.org/v1alpha1
kind: ServiceOffer
metadata:
  name: qwen36-fast
  namespace: llm
spec:
  type: inference

  upstream:
    service: qwen36-external
    namespace: llm
    port: 8000
    healthPath: /health

  capabilities:
    protocol: openai-chat
    streaming: true
    contextWindowTokens: 262144

  requestLimits:
    maxInputTokens: 8192
    maxOutputTokens: 1024
    maxBodyBytes: 10485760

  capacity:
    mode: external-static
    gpuCount: 1
    maxInFlight: 16
    perBuyerMaxInFlight: 4
    maxQueueDepth: 0

  payment:
    network: base-sepolia
    asset: usdc # usdc | obol | 0x...
    payTo: "0x..."
    scheme: exact
    price:
      perRequest: "0.001"
      perMTok: "0.50"

Notes:

  • capacity.mode: external-static means Obol does not control scaling.
  • gpuCount is capacity metadata and preset input, not a Kubernetes scheduling instruction.
  • maxInFlight is the actual hard admission ceiling enforced at the gateway/proxy layer.
  • maxQueueDepth: 0 should be the safe default for paid inference: if saturated, return 429 and do not request payment.
  • perBuyerMaxInFlight is desirable, but can be implemented after global maxInFlight if Traefik/header-ordering makes it non-trivial.
  • payment.asset should be explicit because Obol supports both USDC and OBOL-token settlement. Default to usdc for backwards compatibility, and resolve symbolic assets (usdc, obol) to the network-specific token contract/decimals when writing x402 payment requirements.

Kubernetes/Gateway primitives to use

Existing resources

Continue generating/using:

  • HTTPRoute for /services/<offer>
  • Traefik Middleware for x402 ForwardAuth
  • x402 pricing ConfigMap route entry
  • owner references for garbage collection
  • ServiceOffer.status.conditions

Add generated capacity middleware

For Traefik, generate an additional middleware from spec.capacity.maxInFlight:

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: qwen36-fast-inflight
  namespace: llm
spec:
  inFlightReq:
    amount: 16

Attach it to the same route as the x402 middleware. The desired behavior is:

  • If capacity is available: unpaid request reaches x402 verifier and gets a 402 Payment Required quote.
  • If capacity is saturated: request gets 429 Too Many Requests and should not be quoted for payment.

Implementation detail to verify: middleware ordering must avoid asking the buyer to pay when the global in-flight limit is already saturated.

Optional request-shape guard

If easy, enforce maxBodyBytes with gateway/proxy middleware. Token-shape limits (maxInputTokens, maxOutputTokens) may initially be status/discovery metadata unless the x402 verifier/proxy parses OpenAI requests.

Admission behavior

buyer request
  -> Gateway/HTTPRoute
  -> capacity middleware / admission check
     saturated: 429, Retry-After, no payment quote
  -> x402 ForwardAuth
     missing payment: 402 quote
     valid payment: pass
  -> upstream Service pointing at external endpoint

For v1, we can avoid reservations entirely. A reservation layer can come later if we see buyer UX issues between quote and paid retry.

Status fields

Add enough status to make obol sell status and discovery honest:

status:
  capacity:
    mode: external-static
    scalingControlled: false
    gpuCount: 1
    maxInFlight: 16
    perBuyerMaxInFlight: 4
    inFlight: 7
    state: Available # Available | Saturated | Unknown
  endpoint: https://.../services/qwen36-fast

inFlight can come from Traefik/x402 metrics when available; otherwise report Unknown rather than pretending.

CLI UX

Minimal explicit form:

obol sell inference qwen36-fast \
  --upstream qwen36-external \
  --namespace llm \
  --port 8000 \
  --runtime openai \
  --payment-asset usdc \
  --price 0.001 \
  --gpu-count 1 \
  --max-inflight 16 \
  --per-buyer-max-inflight 4 \
  --max-input-tokens 8192 \
  --max-output-tokens 1024

Optional preset form later:

obol sell inference qwen36-fast \
  --upstream qwen36-external \
  --runtime openai \
  --capacity-preset external-vllm-1gpu-balanced

The preset should expand into static gateway limits only. It should not imply autoscaling.

Acceptance criteria

  • ServiceOffer CRD accepts a minimal external-static capacity envelope.
  • Reconciler creates a Traefik inFlightReq middleware for spec.capacity.maxInFlight.
  • HTTPRoute attaches capacity middleware and x402 ForwardAuth in an order that avoids payment quotes when saturated.
  • Saturated offer returns 429/Retry-After, not 402.
  • Unsaturated unpaid request still returns normal x402 402 terms.
  • Paid request forwards to the external inference endpoint as before.
  • obol sell status shows capacity mode, scalingControlled=false, GPU count, max in-flight, and current/unknown in-flight state.
  • Existing obol sell http / sell inference behavior keeps working when capacity fields are omitted.
  • Payment asset is explicit in the payment terms (payment.asset / --payment-asset) with backwards-compatible default usdc.
  • x402 pricing/config output preserves the selected settlement asset so buyers know whether they are paying USDC or OBOL.
  • Docs explicitly state that external-static capacity is seller-declared and Obol does not autoscale the endpoint.

Open questions

  • Should gpuCount be required for type: inference capacity presets, or optional metadata only?
  • Do we need per-buyer concurrency in v1, or is global max-in-flight enough for the first cut?
  • Can Traefik middleware ordering give us 429 before 402 cleanly, or do we need a tiny x402-aware admission shim later?
  • Should maxQueueDepth be omitted entirely for v1 and default to no queue?
  • Should multi-asset offers use a future payment.accepts[] list, or should one ServiceOffer advertise exactly one settlement asset for minimal v1?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions