Skip to content

WASM integration race condition: stale SessionSpawned dispatch fails on retry #142

@nerdsane

Description

@nerdsane

Problem

When multiple CurationJobs are submitted simultaneously and some fail due to Cedar policy issues, retrying them causes a race condition:

  1. Job is Submitted → build_session_message WASM fires asynchronously
  2. WASM fails (Cedar denied http_call) → job moves to Failed
  3. Cedar policy is approved, job is Retried (Failed → Queued) and re-Submitted
  4. New build_session_message WASM fires and successfully spawns a session, dispatching SessionSpawned → job moves to Running
  5. The stale WASM from step 1 (or a previous retry) also completes and tries to dispatch SessionSpawned
  6. Entity is already at Running → 409 "Action 'SessionSpawned' not valid from state 'Running'"
  7. WASM treats the 409 as a failure and calls Fail on the job, killing the running session

This affected 3 out of 10 simultaneously submitted jobs in production.

Root Cause

WASM integration completions are not correlated to the specific invocation that triggered them. A stale completion from a previous Submit attempt can dispatch actions into an entity that has already moved past the expected state.

Possible Fixes

  1. Idempotent transitions: Allow SessionSpawned from Running state as a no-op (simplest, but loses the state machine's strictness)
  2. Invocation correlation: Add a nonce/correlation ID to each WASM invocation. Only accept action dispatches from the latest invocation — discard stale completions silently
  3. Guard on sequence number: WASM completion could include the entity's sequence_nr at invocation time. If the entity has advanced past that sequence, the completion is stale and should be discarded
  4. Don't Fail on 409: If the WASM's post-completion action dispatch returns 409 (state conflict), treat it as a stale completion rather than a failure — don't call Fail on the entity

Option 4 is the narrowest fix. Option 2/3 are the most robust.

Reproduction

# Create 10 CurationJobs, configure all, submit all simultaneously
# Some will race and produce the SessionSpawned-from-Running 409

Environment

  • Railway production deployment
  • CurationJob entity with build_session_message WASM integration
  • 10 concurrent job submissions

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions