Problem
When multiple CurationJobs are submitted simultaneously and some fail due to Cedar policy issues, retrying them causes a race condition:
- Job is Submitted →
build_session_message WASM fires asynchronously
- WASM fails (Cedar denied
http_call) → job moves to Failed
- Cedar policy is approved, job is Retried (Failed → Queued) and re-Submitted
- New
build_session_message WASM fires and successfully spawns a session, dispatching SessionSpawned → job moves to Running
- The stale WASM from step 1 (or a previous retry) also completes and tries to dispatch
SessionSpawned
- Entity is already at Running → 409 "Action 'SessionSpawned' not valid from state 'Running'"
- WASM treats the 409 as a failure and calls
Fail on the job, killing the running session
This affected 3 out of 10 simultaneously submitted jobs in production.
Root Cause
WASM integration completions are not correlated to the specific invocation that triggered them. A stale completion from a previous Submit attempt can dispatch actions into an entity that has already moved past the expected state.
Possible Fixes
- Idempotent transitions: Allow
SessionSpawned from Running state as a no-op (simplest, but loses the state machine's strictness)
- Invocation correlation: Add a nonce/correlation ID to each WASM invocation. Only accept action dispatches from the latest invocation — discard stale completions silently
- Guard on sequence number: WASM completion could include the entity's
sequence_nr at invocation time. If the entity has advanced past that sequence, the completion is stale and should be discarded
- Don't Fail on 409: If the WASM's post-completion action dispatch returns 409 (state conflict), treat it as a stale completion rather than a failure — don't call
Fail on the entity
Option 4 is the narrowest fix. Option 2/3 are the most robust.
Reproduction
# Create 10 CurationJobs, configure all, submit all simultaneously
# Some will race and produce the SessionSpawned-from-Running 409
Environment
- Railway production deployment
- CurationJob entity with
build_session_message WASM integration
- 10 concurrent job submissions
Problem
When multiple CurationJobs are submitted simultaneously and some fail due to Cedar policy issues, retrying them causes a race condition:
build_session_messageWASM fires asynchronouslyhttp_call) → job moves to Failedbuild_session_messageWASM fires and successfully spawns a session, dispatchingSessionSpawned→ job moves to RunningSessionSpawnedFailon the job, killing the running sessionThis affected 3 out of 10 simultaneously submitted jobs in production.
Root Cause
WASM integration completions are not correlated to the specific invocation that triggered them. A stale completion from a previous Submit attempt can dispatch actions into an entity that has already moved past the expected state.
Possible Fixes
SessionSpawnedfrom Running state as a no-op (simplest, but loses the state machine's strictness)sequence_nrat invocation time. If the entity has advanced past that sequence, the completion is stale and should be discardedFailon the entityOption 4 is the narrowest fix. Option 2/3 are the most robust.
Reproduction
Environment
build_session_messageWASM integration