Problem
`state_timeout` (ADR-0049) fires on wall-clock budget for a state, regardless of whether a WASM integration for that entity is actively running. Observed in production 2026-04-18: two Katagami sessions hit `Session.TimeoutFail` from state `Executing` at exactly 300s while `Session.ProcessToolCalls.integrations` was still executing — one single run_tools invocation ran for 372s. The watchdog killed the session mid-handler.
The spec tried to guard this with `reset_on = ["Heartbeat", "CheckpointToolBatch"]`, but:
- Across 148 `ProcessToolCalls` events in the incident window, 0 `Heartbeat` and 0 `CheckpointToolBatch` events were emitted.
- The burden of wiring progress signals is on every integration author, per-state, per-consumer. Forgetting it silently ships a trap state.
- Even if authors wire it correctly, "handler is running" is not actually proof that the entity isn't stuck — a handler can hang on a slow HTTP call, loop on bad input, or block on a downstream ask that is itself stuck. Coupling state-timer to handler liveness just relocates the trap state.
The real design question is: how does the platform distinguish "slow but making progress" from "hung"? That requires two separate deadlines, not one.
The missing primitive
Today:
- `TEMPER_ACTION_TIMEOUT_SECS=5` is an ask-reach timeout (actor accepts the message). Not a handler-execution deadline.
- Once the actor begins dispatching the integration, the WASM handler runs asynchronously with no runtime-enforced wall-clock budget.
- `state_timeout` fires on the state machine from outside, killing legitimate work.
We need:
- Handler-level deadline, propagated into every WASM integration, enforced by the host. If the handler exceeds it, the host kills the guest and surfaces a deterministic `HandlerDeadlineExceeded` → the state machine consumes a normal `HandlerTimedOut` transition.
- State-level deadline goes back to its simple meaning: "no action dispatched at all for N seconds" — pure idle detection, last resort.
- `reset_on` disappears for normal intra-handler progress. It stays only for genuine cross-entity progress signals (another entity pinging this one).
With both layers, a hung handler is the runtime's problem to terminate, not something `state_timeout` guesses at.
Design sketch
- Wasmtime epoch-based interruption in `temper-runtime/src/wasm/` — epoch ticker incremented on a dedicated thread; the guest yields at every host-function boundary and at instruction-level intervals.
- `DispatchDeadline` carried in the dispatch context, propagated through host-function boundaries.
- Host-function ABI: every host function checks deadline before proceeding, returns `Err(DeadlineExceeded)` cleanly.
- Deterministic replay story: deadlines are clock-derived during live execution but recorded in the event log so DST replay produces the same timeout decisions. Non-determinism here would break the property-testing harness.
- New error variant `ActorError::HandlerDeadlineExceeded`. Classifier: transient from dispatch's view (operator can retry with a larger budget if desired) but the state machine sees a distinct normal transition, not a retry.
- Span linkage: the killed handler span and the resulting `HandlerTimedOut` state transition share `trace_id` so Datadog shows them in one trace.
Datadog instrumentation (must land with the primitive)
- `temper_handler_deadline_remaining_ms` — gauge at dispatch start, by `entity_type`, `action`. See how tight budgets are.
- `temper_handler_deadline_exceeded_total` — counter tagged `entity_type`, `action`, `dying_span` (which host function was running when the handler died). Without `dying_span` this metric is useless — it is what tells us whether web_search, provider_caller, or temper.read is the hang source.
- `temper_wasm_epoch_tick_interval_ms` — histogram. If ticker drifts, deadlines become imprecise.
- Dashboards + monitors added to `openpaw-overview.json` / `openpaw-monitors.json` under a new `Handler Liveness` group.
Acceptance criteria
- A WASM integration that sleeps indefinitely is killed deterministically inside its declared budget, and the state machine observes a normal transition (not a dangling actor).
- Killed handlers leave no lingering CPU (continuous profiler confirms).
- DST replay produces identical `HandlerTimedOut` decisions given the recorded event log.
- After consumer migration, `reset_on` can be removed from all intra-handler-progress declarations in OpenPaw specs without increasing false-positive `state_timeout` fires.
Related
- Incident 2026-04-18: Katagami regeneration burst, 4/21 sessions failed (2 from this issue, 2 from #(dispatch contention issue — link on create)).
- ADR-0049 (state-entry timeouts) — this issue adds the handler-deadline layer underneath it. A new ADR should document both primitives together.
- OpenPaw `session.ioa.toml` — `reset_on` declarations become obsolete for normal handler activity once this lands.
Out of scope
- Raising `Executing`'s `after_seconds` from 300.
- Teaching `monty_repl` to emit `CheckpointToolBatch` more aggressively.
Both are workarounds for the missing handler-deadline primitive.
Problem
`state_timeout` (ADR-0049) fires on wall-clock budget for a state, regardless of whether a WASM integration for that entity is actively running. Observed in production 2026-04-18: two Katagami sessions hit `Session.TimeoutFail` from state `Executing` at exactly 300s while `Session.ProcessToolCalls.integrations` was still executing — one single run_tools invocation ran for 372s. The watchdog killed the session mid-handler.
The spec tried to guard this with `reset_on = ["Heartbeat", "CheckpointToolBatch"]`, but:
The real design question is: how does the platform distinguish "slow but making progress" from "hung"? That requires two separate deadlines, not one.
The missing primitive
Today:
We need:
With both layers, a hung handler is the runtime's problem to terminate, not something `state_timeout` guesses at.
Design sketch
Datadog instrumentation (must land with the primitive)
Acceptance criteria
Related
Out of scope
Both are workarounds for the missing handler-deadline primitive.