Handler-level deadline propagation + enforcement for WASM integrations

## Problem

\`state_timeout\` (ADR-0049) fires on wall-clock budget for a state, regardless of whether a WASM integration for that entity is actively running. Observed in production 2026-04-18: two Katagami sessions hit \`Session.TimeoutFail\` from state \`Executing\` at exactly 300s while \`Session.ProcessToolCalls.integrations\` was still executing — one single run_tools invocation ran for 372s. The watchdog killed the session mid-handler.

The spec tried to guard this with \`reset_on = [\"Heartbeat\", \"CheckpointToolBatch\"]\`, but:
- Across 148 \`ProcessToolCalls\` events in the incident window, **0 \`Heartbeat\` and 0 \`CheckpointToolBatch\` events were emitted**.
- The burden of wiring progress signals is on every integration author, per-state, per-consumer. Forgetting it silently ships a trap state.
- Even if authors wire it correctly, \"handler is running\" is not actually proof that the entity isn't stuck — a handler can hang on a slow HTTP call, loop on bad input, or block on a downstream ask that is itself stuck. Coupling state-timer to handler liveness just relocates the trap state.

The real design question is: **how does the platform distinguish \"slow but making progress\" from \"hung\"?** That requires two separate deadlines, not one.

## The missing primitive

Today:
- \`TEMPER_ACTION_TIMEOUT_SECS=5\` is an **ask-reach** timeout (actor accepts the message). Not a handler-execution deadline.
- Once the actor begins dispatching the integration, the WASM handler runs asynchronously with **no runtime-enforced wall-clock budget**.
- \`state_timeout\` fires on the state machine from outside, killing legitimate work.

We need:
- **Handler-level deadline**, propagated into every WASM integration, enforced by the host. If the handler exceeds it, the host kills the guest and surfaces a deterministic \`HandlerDeadlineExceeded\` → the state machine consumes a normal \`HandlerTimedOut\` transition.
- **State-level deadline** goes back to its simple meaning: \"no action dispatched at all for N seconds\" — pure idle detection, last resort.
- \`reset_on\` disappears for normal intra-handler progress. It stays only for genuine cross-entity progress signals (another entity pinging this one).

With both layers, a hung handler is the runtime's problem to terminate, not something \`state_timeout\` guesses at.

## Design sketch

- Wasmtime epoch-based interruption in \`temper-runtime/src/wasm/\` — epoch ticker incremented on a dedicated thread; the guest yields at every host-function boundary and at instruction-level intervals.
- \`DispatchDeadline\` carried in the dispatch context, propagated through host-function boundaries.
- Host-function ABI: every host function checks deadline before proceeding, returns \`Err(DeadlineExceeded)\` cleanly.
- Deterministic replay story: deadlines are clock-derived during live execution but **recorded in the event log** so DST replay produces the same timeout decisions. Non-determinism here would break the property-testing harness.
- New error variant \`ActorError::HandlerDeadlineExceeded\`. Classifier: transient from dispatch's view (operator can retry with a larger budget if desired) but **the state machine sees a distinct normal transition**, not a retry.
- Span linkage: the killed handler span and the resulting \`HandlerTimedOut\` state transition share \`trace_id\` so Datadog shows them in one trace.

## Datadog instrumentation (must land with the primitive)

- \`temper_handler_deadline_remaining_ms\` — gauge at dispatch start, by \`entity_type\`, \`action\`. See how tight budgets are.
- \`temper_handler_deadline_exceeded_total\` — counter tagged \`entity_type\`, \`action\`, \`dying_span\` (which host function was running when the handler died). **Without \`dying_span\` this metric is useless** — it is what tells us whether web_search, provider_caller, or temper.read is the hang source.
- \`temper_wasm_epoch_tick_interval_ms\` — histogram. If ticker drifts, deadlines become imprecise.
- Dashboards + monitors added to \`openpaw-overview.json\` / \`openpaw-monitors.json\` under a new \`Handler Liveness\` group.

## Acceptance criteria

- A WASM integration that sleeps indefinitely is killed deterministically inside its declared budget, and the state machine observes a normal transition (not a dangling actor).
- Killed handlers leave no lingering CPU (continuous profiler confirms).
- DST replay produces identical \`HandlerTimedOut\` decisions given the recorded event log.
- After consumer migration, \`reset_on\` can be removed from all intra-handler-progress declarations in OpenPaw specs without increasing false-positive \`state_timeout\` fires.

## Related

- Incident 2026-04-18: Katagami regeneration burst, 4/21 sessions failed (2 from this issue, 2 from #(dispatch contention issue — link on create)).
- ADR-0049 (state-entry timeouts) — this issue adds the handler-deadline layer underneath it. A new ADR should document both primitives together.
- OpenPaw \`session.ioa.toml\` — \`reset_on\` declarations become obsolete for normal handler activity once this lands.

## Out of scope

- Raising \`Executing\`'s \`after_seconds\` from 300.
- Teaching \`monty_repl\` to emit \`CheckpointToolBatch\` more aggressively.

Both are workarounds for the missing handler-deadline primitive.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handler-level deadline propagation + enforcement for WASM integrations #147

Problem

The missing primitive

Design sketch

Datadog instrumentation (must land with the primitive)

Acceptance criteria

Related

Out of scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Handler-level deadline propagation + enforcement for WASM integrations #147

Description

Problem

The missing primitive

Design sketch

Datadog instrumentation (must land with the primitive)

Acceptance criteria

Related

Out of scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions