Skip to content

[Enhancement]: Add OpenClaw LLM execution middleware for provider calls #78

@mnajafian-nv

Description

@mnajafian-nv

Affected area

Plugins, Observability or exporters, Third-party integration patches

Problem or opportunity

TL;DR

Request OpenClaw support for one invocation-scoped LLM execution middleware around each provider/model call. OpenClaw already has most of the needed data internally; this issue asks to expose it through a stable public provider-call boundary so NeMo Flow can build authoritative Phoenix/OpenInference spans for security, optimization, and observability without patching OpenClaw internals or guessing from separate hooks.

The middleware should be additive, provider-neutral, privacy-aware, and backward-compatible with existing plugin hooks. Observability should be fail-open by default; blocking, rewriting, routing, or annotation should be explicit policy-controlled behavior.

Related

Problem

The NeMo Flow OpenClaw plugin can produce useful Phoenix/OpenInference traces from current public hooks, but those traces are not always authoritative optimization evidence.

Current hooks expose related signals separately:

  • model_call_started / model_call_ended: callId, provider/model/API/transport, duration, TTFB, byte counts, and error category
  • llm_input: prompt, system prompt, history snapshot, and image count
  • llm_output: assistant text and accumulated usage
  • message/tool hooks: assistant/tool side effects after provider output has been mapped into session messages
  • trajectory metadata: run-level session, agent, provider, model API, config, plugins, redaction policy, final usage, and prompt-cache artifacts

Those are useful pieces, but they are not one provider-call contract. In multi-step loops such as LLM -> tool -> LLM -> tool -> LLM, one run can contain several provider calls and tool calls. Pairing request, response, usage, timing, tool-call metadata, and final output by ordering is ambiguous with streaming, retries, fallbacks, compaction retries, or concurrent tool activity.

PR #67 is intentionally conservative: when timing cannot be paired safely, the plugin emits diagnostic marks instead of inventing latency. That is correct for a hook-based integration, but it means token, cache, cost, latency, TTFB, retry/fallback state, and model-emitted tool-call metadata do not always stay attached to the provider invocation that produced them.

Proposed enhancement

OpenClaw already has most of the required data internally:

  • model-call diagnostics create callId and record timing, TTFB, byte counts, provider/model/API/transport, and failure category
  • provider transports build normalized requests and parse response events, finish reasons, response ids, tool calls, usage, cache counters, and cost
  • prompt-cache observability and trajectory metadata capture request-shape, run/session/agent/config/plugin, and redaction context
  • run attempt metadata and failover logging represent retries, profile rotation, fallback decisions, and error/status details
  • PR feat: add OpenClaw observability plugin #67 adds bounded correlation, placeholder replay, ambiguity/unpaired timing marks, fail-open replay handling, and session-end draining

The NeMo Flow eval patches show why provider-call fidelity matters:

  • Provider cache evidence is API-surface specific. OpenAI-compatible routes expose cache reuse through usage.prompt_tokens_details.cached_tokens or usage.input_tokens_details.cached_tokens; Anthropic Messages routes expose usage.cache_read_input_tokens and usage.cache_creation_input_tokens.
  • Cache mode follows the provider API surface, not only the model family string. A routed Anthropic model on an OpenAI-compatible endpoint needs OpenAI-style cache evidence, while a native Anthropic Messages route needs Anthropic-style cache evidence.
  • The patched codec path had to emit provider-native usage for openai_chat, openai_responses, and anthropic_messages; otherwise Phoenix/OpenInference output could not prove provider cache behavior.
  • ACG and tool-policy optimization need stable request-shape identifiers, effective tool-schema evidence, and a defined telemetry completion point. Volatile task text is not a reliable key and may be redacted.

Request OpenClaw support for a public LLM execution middleware that wraps each provider/model invocation.

The middleware should be invocation-scoped, not only post-hoc. It should allow a plugin to observe the request before execution and the response or failure after execution. Where OpenClaw policy allows, the same shape should support blocking, rewriting, routing, or annotating the call.

The proposed middleware should compose OpenClaw's existing internal provider-call data into one invocation-scoped public record. It should not expose trajectory metadata wholesale as a plugin API; trajectory metadata is broader and run-scoped.

Runtime contract and binding impact

Control semantics should be explicit:

  • before: called after provider/model/API/transport resolution and normalized request construction, before dispatch
  • chunk: optional streaming callback with sanitized provider/native chunk information or normalized chunk metadata
  • after: called once for a successful invocation with final response, usage, cost, timing, finish reason, and model-emitted tool-call metadata
  • error: called once for a failed invocation with error/status metadata, elapsed timing, retry/fallback metadata, and any known partial usage/cost

The same callId must be present across all phases, and each dispatched invocation should emit exactly one terminal phase: after or error. If routing changes provider/model/API, OpenClaw should rebuild the provider request before dispatch.

For NeMo Flow to treat plugin traces as authoritative, the middleware context needs:

  • stable invocation id / callId
  • optional logical call id when retries/fallbacks belong to one higher-level agent request
  • retry/fallback attempt metadata
  • run/session/agent context
  • provider, model, codec/API/transport, and request-surface metadata
  • effective tool schema/inventory metadata or a stable fingerprint
  • normalized provider request before execution
  • final normalized LLM response envelope after execution, including streaming chunks or an accumulated final response
  • provider-native usage before normalization, including OpenAI-compatible and Anthropic Messages cache fields
  • normalized input/output/total tokens, cache read/write counters, and cost when available
  • start/end timing, latency, and TTFB when available
  • finish reason and model-emitted tool-call metadata
  • failure/error metadata for provider exceptions
  • sanitized raw payloads where allowed by OpenClaw privacy policy

Design constraints:

  • Keep the public API provider-neutral while preserving provider-native usage under a structured field.
  • Expose stable request-shape metadata for optimization without requiring volatile prompt text.
  • Treat raw request/response payloads as policy-gated diagnostic data.
  • Keep existing hooks backward-compatible.
  • Run at the transport/provider boundary, not final assistant-message replay.
  • Preserve fail-open behavior for observability plugins unless a plugin is explicitly configured for blocking/security behavior.
  • Define a telemetry completion point so short-lived runs can export final provider-call evidence before shutdown.

Binding impact:

  • No required change to existing plugin hooks if this is added as a new middleware capability.
  • Existing plugins can ignore this middleware and continue using current hooks.
  • NeMo Flow would bind to the middleware and map each provider call directly to one Phoenix/OpenInference LLM span.
  • This should reduce or remove the current best-effort correlation logic in the NeMo Flow OpenClaw plugin.

Alternatives considered

Acceptance criteria

  • OpenClaw exposes a public LLM execution middleware around provider/model invocations.
  • Each dispatched invocation has a stable callId; retries/fallbacks are distinguishable and can share an optional logical call id.
  • Middleware can observe normalized provider requests before execution and final normalized response envelopes after execution for streaming and non-streaming calls.
  • The middleware provides one provider-call boundary that is sufficient for security, optimization, and observability use cases without requiring separate provider-call lifecycle hooks.
  • Completion data preserves provider-native usage before normalization, including OpenAI-compatible cached-token fields and Anthropic Messages cache read/write fields.
  • Completion data exposes normalized tokens, cache counters, cost when available, finish reason, latency, TTFB, and model-emitted tool-call metadata.
  • Failure data exposes error type/status, elapsed timing, retry/fallback metadata, and known usage/cost for failed attempts.
  • Payloads follow OpenClaw privacy/redaction policy and do not expose secrets.
  • Observation failures are isolated from model execution by default.
  • Existing plugin hooks remain backward-compatible.
  • A NeMo Flow plugin can map each provider call directly to one Phoenix/OpenInference LLM span without message-order or timing-candidate heuristics.
  • A multi-step agent loop can produce an accurate LLM -> tool -> LLM -> tool -> LLM trace with correct token/cache/cost attribution per LLM span.

Metadata

Metadata

Assignees

Labels

Improvementimprovement to existing functionality

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions