Skip to content

Handler-level deadline propagation + enforcement for WASM integrations #147

@rita-aga

Description

@rita-aga

Problem

`state_timeout` (ADR-0049) fires on wall-clock budget for a state, regardless of whether a WASM integration for that entity is actively running. Observed in production 2026-04-18: two Katagami sessions hit `Session.TimeoutFail` from state `Executing` at exactly 300s while `Session.ProcessToolCalls.integrations` was still executing — one single run_tools invocation ran for 372s. The watchdog killed the session mid-handler.

The spec tried to guard this with `reset_on = ["Heartbeat", "CheckpointToolBatch"]`, but:

  • Across 148 `ProcessToolCalls` events in the incident window, 0 `Heartbeat` and 0 `CheckpointToolBatch` events were emitted.
  • The burden of wiring progress signals is on every integration author, per-state, per-consumer. Forgetting it silently ships a trap state.
  • Even if authors wire it correctly, "handler is running" is not actually proof that the entity isn't stuck — a handler can hang on a slow HTTP call, loop on bad input, or block on a downstream ask that is itself stuck. Coupling state-timer to handler liveness just relocates the trap state.

The real design question is: how does the platform distinguish "slow but making progress" from "hung"? That requires two separate deadlines, not one.

The missing primitive

Today:

  • `TEMPER_ACTION_TIMEOUT_SECS=5` is an ask-reach timeout (actor accepts the message). Not a handler-execution deadline.
  • Once the actor begins dispatching the integration, the WASM handler runs asynchronously with no runtime-enforced wall-clock budget.
  • `state_timeout` fires on the state machine from outside, killing legitimate work.

We need:

  • Handler-level deadline, propagated into every WASM integration, enforced by the host. If the handler exceeds it, the host kills the guest and surfaces a deterministic `HandlerDeadlineExceeded` → the state machine consumes a normal `HandlerTimedOut` transition.
  • State-level deadline goes back to its simple meaning: "no action dispatched at all for N seconds" — pure idle detection, last resort.
  • `reset_on` disappears for normal intra-handler progress. It stays only for genuine cross-entity progress signals (another entity pinging this one).

With both layers, a hung handler is the runtime's problem to terminate, not something `state_timeout` guesses at.

Design sketch

  • Wasmtime epoch-based interruption in `temper-runtime/src/wasm/` — epoch ticker incremented on a dedicated thread; the guest yields at every host-function boundary and at instruction-level intervals.
  • `DispatchDeadline` carried in the dispatch context, propagated through host-function boundaries.
  • Host-function ABI: every host function checks deadline before proceeding, returns `Err(DeadlineExceeded)` cleanly.
  • Deterministic replay story: deadlines are clock-derived during live execution but recorded in the event log so DST replay produces the same timeout decisions. Non-determinism here would break the property-testing harness.
  • New error variant `ActorError::HandlerDeadlineExceeded`. Classifier: transient from dispatch's view (operator can retry with a larger budget if desired) but the state machine sees a distinct normal transition, not a retry.
  • Span linkage: the killed handler span and the resulting `HandlerTimedOut` state transition share `trace_id` so Datadog shows them in one trace.

Datadog instrumentation (must land with the primitive)

  • `temper_handler_deadline_remaining_ms` — gauge at dispatch start, by `entity_type`, `action`. See how tight budgets are.
  • `temper_handler_deadline_exceeded_total` — counter tagged `entity_type`, `action`, `dying_span` (which host function was running when the handler died). Without `dying_span` this metric is useless — it is what tells us whether web_search, provider_caller, or temper.read is the hang source.
  • `temper_wasm_epoch_tick_interval_ms` — histogram. If ticker drifts, deadlines become imprecise.
  • Dashboards + monitors added to `openpaw-overview.json` / `openpaw-monitors.json` under a new `Handler Liveness` group.

Acceptance criteria

  • A WASM integration that sleeps indefinitely is killed deterministically inside its declared budget, and the state machine observes a normal transition (not a dangling actor).
  • Killed handlers leave no lingering CPU (continuous profiler confirms).
  • DST replay produces identical `HandlerTimedOut` decisions given the recorded event log.
  • After consumer migration, `reset_on` can be removed from all intra-handler-progress declarations in OpenPaw specs without increasing false-positive `state_timeout` fires.

Related

  • Incident 2026-04-18: Katagami regeneration burst, 4/21 sessions failed (2 from this issue, 2 from #(dispatch contention issue — link on create)).
  • ADR-0049 (state-entry timeouts) — this issue adds the handler-deadline layer underneath it. A new ADR should document both primitives together.
  • OpenPaw `session.ioa.toml` — `reset_on` declarations become obsolete for normal handler activity once this lands.

Out of scope

  • Raising `Executing`'s `after_seconds` from 300.
  • Teaching `monty_repl` to emit `CheckpointToolBatch` more aggressively.

Both are workarounds for the missing handler-deadline primitive.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions