feat: minimal external inference capacity envelope for ServiceOffer

## Background

`ServiceOffer` should support paid inference capacity without turning Obol into a Kubernetes scheduler or trying to manage the inference runtime.

For the first implementation, assume the OpenAI-compatible/vLLM-like inference endpoint is **outside the Obol stack cluster**. Obol has exactly one upstream endpoint to work with and **zero control over scaling**. This matches host-local vLLM / llama.cpp / DGX Spark style setups where the serving process is already running and the cluster only needs to expose it safely through x402.

This is intentionally the minimal-surface issue. In-stack multi-GPU inference servers, GPU device plugins, KEDA/HPA, RuntimeClasses, and Kubernetes-managed vLLM deployments are separate work. See #430.

## Goals

- Add a minimal capacity/admission envelope to `ServiceOffer` for external inference endpoints.
- Use Kubernetes/Gateway primitives we already have instead of inventing a scheduler.
- Keep the first version compatible with the current Traefik + Gateway API + x402 ForwardAuth architecture.
- Treat GPU count as the relevant capacity dimension for inference, not generic cloud CPU scaling.
- Avoid any autoscaling semantics for this phase: one external endpoint, static seller-declared capacity.
- Prevent buyers from paying into an obviously saturated service.

## Non-goals

- Do not run vLLM/llama.cpp inside the Obol cluster in this issue.
- Do not create or manage `Deployment`, `StatefulSet`, `HPA`, `KEDA ScaledObject`, `RuntimeClass`, or GPU device plugin resources here.
- Do not scale replicas based on CPU/memory metrics.
- Do not create one Kubernetes object per request/reservation.
- Do not implement full token metering/per-token settlement here; this is admission/capacity protection around the existing x402 route.

## Proposed `ServiceOffer` shape

Keep this small and seller-facing:

```yaml
apiVersion: obol.org/v1alpha1
kind: ServiceOffer
metadata:
  name: qwen36-fast
  namespace: llm
spec:
  type: inference

  upstream:
    service: qwen36-external
    namespace: llm
    port: 8000
    healthPath: /health

  capabilities:
    protocol: openai-chat
    streaming: true
    contextWindowTokens: 262144

  requestLimits:
    maxInputTokens: 8192
    maxOutputTokens: 1024
    maxBodyBytes: 10485760

  capacity:
    mode: external-static
    gpuCount: 1
    maxInFlight: 16
    perBuyerMaxInFlight: 4
    maxQueueDepth: 0

  payment:
    network: base-sepolia
    asset: usdc # usdc | obol | 0x...
    payTo: "0x..."
    scheme: exact
    price:
      perRequest: "0.001"
      perMTok: "0.50"
```

Notes:

- `capacity.mode: external-static` means Obol does not control scaling.
- `gpuCount` is capacity metadata and preset input, not a Kubernetes scheduling instruction.
- `maxInFlight` is the actual hard admission ceiling enforced at the gateway/proxy layer.
- `maxQueueDepth: 0` should be the safe default for paid inference: if saturated, return `429` and do not request payment.
- `perBuyerMaxInFlight` is desirable, but can be implemented after global `maxInFlight` if Traefik/header-ordering makes it non-trivial.
- `payment.asset` should be explicit because Obol supports both USDC and OBOL-token settlement. Default to `usdc` for backwards compatibility, and resolve symbolic assets (`usdc`, `obol`) to the network-specific token contract/decimals when writing x402 payment requirements.

## Kubernetes/Gateway primitives to use

### Existing resources

Continue generating/using:

- `HTTPRoute` for `/services/<offer>`
- Traefik `Middleware` for x402 `ForwardAuth`
- x402 pricing `ConfigMap` route entry
- owner references for garbage collection
- `ServiceOffer.status.conditions`

### Add generated capacity middleware

For Traefik, generate an additional middleware from `spec.capacity.maxInFlight`:

```yaml
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: qwen36-fast-inflight
  namespace: llm
spec:
  inFlightReq:
    amount: 16
```

Attach it to the same route as the x402 middleware. The desired behavior is:

- If capacity is available: unpaid request reaches x402 verifier and gets a `402 Payment Required` quote.
- If capacity is saturated: request gets `429 Too Many Requests` and should not be quoted for payment.

Implementation detail to verify: middleware ordering must avoid asking the buyer to pay when the global in-flight limit is already saturated.

### Optional request-shape guard

If easy, enforce `maxBodyBytes` with gateway/proxy middleware. Token-shape limits (`maxInputTokens`, `maxOutputTokens`) may initially be status/discovery metadata unless the x402 verifier/proxy parses OpenAI requests.

## Admission behavior

```text
buyer request
  -> Gateway/HTTPRoute
  -> capacity middleware / admission check
     saturated: 429, Retry-After, no payment quote
  -> x402 ForwardAuth
     missing payment: 402 quote
     valid payment: pass
  -> upstream Service pointing at external endpoint
```

For v1, we can avoid reservations entirely. A reservation layer can come later if we see buyer UX issues between quote and paid retry.

## Status fields

Add enough status to make `obol sell status` and discovery honest:

```yaml
status:
  capacity:
    mode: external-static
    scalingControlled: false
    gpuCount: 1
    maxInFlight: 16
    perBuyerMaxInFlight: 4
    inFlight: 7
    state: Available # Available | Saturated | Unknown
  endpoint: https://.../services/qwen36-fast
```

`inFlight` can come from Traefik/x402 metrics when available; otherwise report `Unknown` rather than pretending.

## CLI UX

Minimal explicit form:

```bash
obol sell inference qwen36-fast \
  --upstream qwen36-external \
  --namespace llm \
  --port 8000 \
  --runtime openai \
  --payment-asset usdc \
  --price 0.001 \
  --gpu-count 1 \
  --max-inflight 16 \
  --per-buyer-max-inflight 4 \
  --max-input-tokens 8192 \
  --max-output-tokens 1024
```

Optional preset form later:

```bash
obol sell inference qwen36-fast \
  --upstream qwen36-external \
  --runtime openai \
  --capacity-preset external-vllm-1gpu-balanced
```

The preset should expand into static gateway limits only. It should not imply autoscaling.

## Acceptance criteria

- [ ] `ServiceOffer` CRD accepts a minimal external-static capacity envelope.
- [ ] Reconciler creates a Traefik `inFlightReq` middleware for `spec.capacity.maxInFlight`.
- [ ] HTTPRoute attaches capacity middleware and x402 ForwardAuth in an order that avoids payment quotes when saturated.
- [ ] Saturated offer returns `429`/`Retry-After`, not `402`.
- [ ] Unsaturated unpaid request still returns normal x402 `402` terms.
- [ ] Paid request forwards to the external inference endpoint as before.
- [ ] `obol sell status` shows capacity mode, `scalingControlled=false`, GPU count, max in-flight, and current/unknown in-flight state.
- [ ] Existing `obol sell http` / `sell inference` behavior keeps working when capacity fields are omitted.
- [ ] Payment asset is explicit in the payment terms (`payment.asset` / `--payment-asset`) with backwards-compatible default `usdc`.
- [ ] x402 pricing/config output preserves the selected settlement asset so buyers know whether they are paying USDC or OBOL.
- [ ] Docs explicitly state that external-static capacity is seller-declared and Obol does not autoscale the endpoint.

## Open questions

- Should `gpuCount` be required for `type: inference` capacity presets, or optional metadata only?
- Do we need per-buyer concurrency in v1, or is global max-in-flight enough for the first cut?
- Can Traefik middleware ordering give us `429 before 402` cleanly, or do we need a tiny x402-aware admission shim later?
- Should `maxQueueDepth` be omitted entirely for v1 and default to no queue?
- Should multi-asset offers use a future `payment.accepts[]` list, or should one `ServiceOffer` advertise exactly one settlement asset for minimal v1?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: minimal external inference capacity envelope for ServiceOffer #439

Background

Goals

Non-goals

Proposed `ServiceOffer` shape

Kubernetes/Gateway primitives to use

Existing resources

Add generated capacity middleware

Optional request-shape guard

Admission behavior

Status fields

CLI UX

Acceptance criteria

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: minimal external inference capacity envelope for ServiceOffer #439

Description

Background

Goals

Non-goals

Proposed ServiceOffer shape

Kubernetes/Gateway primitives to use

Existing resources

Add generated capacity middleware

Optional request-shape guard

Admission behavior

Status fields

CLI UX

Acceptance criteria

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Proposed `ServiceOffer` shape