[POP-3781] fix: make SQS-modification persistence flow fail-safe#2159
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 331a2577a2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| self.process_message_(&message, request_type, batch_metadata) | ||
| .await?; | ||
| self.delete_message(&sqs_message).await?; |
There was a problem hiding this comment.
Handle post-processing delete failures without aborting
When DeleteMessage returns an error after process_message_ has already persisted/queued the request, this ? aborts poll_exact_messages and receive_batch; receive_batch_stream then stops on the Err before the assembled BatchQuery is sent for execution. In that transient SQS failure scenario the modification row has been inserted but will never be completed by this batch, and the still-visible SQS message can be redelivered and inserted again, leaving stale/duplicate in-progress modifications. The old ordering failed before mutating local/DB state, so this post-processing ack failure needs different handling, e.g. do not discard the batch after successful processing just because the ack failed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in commit 1a5af37c9
The post-processing DeleteMessage call is now best-effort: if it transiently fails after process_message_/process_batch_message has already returned Ok, we log tracing::error! + bump a new sqs.delete_failure_after_processing counter and continue, rather than propagate. The modification is already persisted at that point, and downstream appliers are idempotent (modifications are keyed by request_id and apply as full-row overwrites — pre-existing assumption from peer-modification-sync). Visibility-timeout redelivery handles the retry on the SQS side.
There was a problem hiding this comment.
Don't we actually want this case to cause a crash and restart so that modifications can be synced again?
As far as I could track, this is already what's happening: if we could persist modification into the db but fail to delete from SQS, we propagate the error up, SQS receive loop is closed, and server tears down, causing a restart on all nodes
There was a problem hiding this comment.
Exactly right — and I had it the wrong way around in my first reply on this thread. The strict ? cascade IS the trigger for the restart-cleanup path: receive_batch_stream tears down → node restart → modifications_sync GCs orphan IN_PROGRESS rows on startup → redelivered SQS message is re-ingested cleanly.
Without strict propagation, the orphan would stick around (modifications_sync only runs on startup) and the redelivered message would produce a parallel COMPLETED row plus a duplicate SNS result. So the cascade isn't just a circuit breaker — it's what causes the cleanup to fire.
Code comment block at the top of process_message is updated in the latest force-push (19a404d24) to spell this out instead of referencing the (now-cancelled) POP-3897. Thanks for the catch.
331a257 to
1a5af37
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1a5af37c90
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if let Err(e) = self.delete_message(&sqs_message).await { | ||
| tracing::error!( | ||
| "SQS DeleteMessage failed after successful processing (type={}): {:?}", | ||
| request_type, | ||
| e | ||
| ); | ||
| metrics::counter!("sqs.delete_failure_after_processing", "type" => request_type.to_string()) | ||
| .increment(1); |
There was a problem hiding this comment.
Don't swallow post-persist SQS delete failures
When DeleteMessage fails after a request has been persisted, this path now returns Ok(()), so the same SQS message can become visible again and be processed a second time. The redelivery is not idempotent at the DB boundary: Store::insert_modification unconditionally inserts a new modifications row (iris-mpc-store/src/lib.rs:491-519), and the table has no request-id key to collapse duplicates, so a transient SQS delete failure can create duplicate modifications/results for the same enrollment/update instead of surfacing an error and stopping before another batch is accepted.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
hmm that's right. Verified: Store::insert_modification at iris-mpc-store/src/lib.rs:491-519 is plain INSERT ... VALUES with no ON CONFLICT, and the modifications table schema has neither a request_id column nor a UNIQUE constraint on (serial_id, request_type). The SQS message body carries request_id but it's never threaded into the modifications table.
Resolution in commit 31d08eeb7 (force-pushed):
- Reverted the best-effort wrap —
DeleteMessageis back on strict?propagation. The cascade tear-down ofreceive_batch_streamon ack failure is now intentional, retained as a circuit breaker against duplicate-row accumulation while SQS is unhealthy. The duplicate window narrows from "every redelivered message creates a duplicate" to "one in-flight message at the moment ack failed, then ingestion stops until the stream restarts." - Follow-up filed: POP-3897 — Make modifications-table inserts idempotent on SQS request_id. Proper fix is
request_id TEXT NOT NULL UNIQUE(orUNIQUE (serial_id, request_type)) +INSERT ... ON CONFLICT DO NOTHING+ plumbing throughStore::insert_modificationfrom the SQS body.
1a5af37 to
31d08ee
Compare
| // receive_batch breaks the spawn loop). That tear-down is | ||
| // intentional in this PR: the modifications table is not yet | ||
| // idempotent at the DB boundary (plain INSERT, no request_id | ||
| // key — see POP-3897), so swallowing the ack failure and | ||
| // continuing would let visibility-timeout redelivery pile up | ||
| // duplicate modification rows. The cascade tear-down acts as a | ||
| // circuit breaker against duplicate accumulation while SQS is | ||
| // unhealthy. Once POP-3897 lands, this comment block and the | ||
| // strict propagation can be revisited. |
There was a problem hiding this comment.
is this still valid, I see POP-3897 being cancelled? 🤔
There was a problem hiding this comment.
Rewrote the comment block in 19a404d24 to drop the POP-3897 link and replace the 'circuit breaker against duplicate accumulation' framing with the actual restart-cleanup mechanism (matches the thread above)
Move delete_message in BatchProcessor::process_message from before the per-type processing branch to after it returns Ok. The previous ordering acked the SQS message before the modification was durably persisted, so a crash in that window silently dropped the message. After this change, Err from inner processing propagates without acking, so SQS visibility- timeout redelivery retries the message. DeleteMessage itself uses strict `?` propagation. If the ack fails after processing has succeeded, the Err tears down receive_batch_stream (batch.rs:116) — intentional, because the modifications table is not idempotent at the DB boundary (plain INSERT, no request_id key, see iris-mpc-store/src/lib.rs:491-519). The cascade acts as a circuit breaker against duplicate-row accumulation while SQS is unhealthy. Proper modifications-table idempotency is tracked in POP-3897. Two adjacent hardenings on the same persistence flow: - delete_message helper: receipt_handle.unwrap() -> ok_or_else with a new ReceiveRequestError::FailedToMarkRequestAsDeleted variant - modifications_sync.rs (peer-sync apply path): panic!/unwrap() on unknown modification type, missing s3_url, and JoinHandle Result converted to typed Err returns. No happy-path behavior change. Cherry-picked equivalents of changes in #2007 (continuous-rerand branch, @philsippl), translated to current main shape; the rest of #2007 remains parked on ps/cont-rerand. Linear: https://linear.app/worldcoin/issue/POP-3781 Follow-up: https://linear.app/worldcoin/issue/POP-3897
31d08ee to
19a404d
Compare
Summary
Move
delete_messageinBatchProcessor::process_messagefrom before the per-type processing branch to after it returnsOk, plus two small adjacent hardenings on the same persistence flow.The bug (POP-3781)
Today,
process_messagedeletes the SQS message immediately after parsing its message-type attribute, before dispatching to the inner processing branch. If the process crashes — or returnsErr— between that ack and the modification being durably written to the DB, the message is lost: SQS will not redeliver it, and the DB never received it.What's in this PR
Delete-after-persist ordering in
BatchProcessor::process_message— both the SNS-batching path (#[cfg(feature = "explicit-sns-batching")]) and the normal path now ack only afterprocess_batch_message?/process_message_?returnsOk. The core fix.delete_messagehelper hardening —sqs_message.receipt_handle.as_ref().unwrap()→ok_or_else(...)returning a newReceiveRequestError::FailedToMarkRequestAsDeleted(Report)variant. The unwrap was effectively safe (SQS receive always populatesreceipt_handle), but this function is now invoked at a more critical moment, so converting the panic into a typed error is warranted hygiene.modifications_sync.rshygiene (peer-modification-sync apply path) — same conceptual concern (don't take the node down on bad input):modification.clone().s3_url.unwrap()→.s3_url.clone().ok_or_else(|| eyre!(...)).await?.unwrap()on the awaited JoinHandle →.await??(propagates the innerReportinstead of panicking)panic!("Unknown modification type")→Err(eyre!(...))typed returnErrto its caller instead of taking the node down.Why strict
?propagation onDeleteMessageIf the SQS
DeleteMessagecall fails afterprocess_message_has already returnedOk, the error propagates rather than being swallowed. The cascade tear-down ofreceive_batch_stream(batch.rs:116— anyErrfromreceive_batchbreaks the spawn loop) triggers a node restart, at which pointmodifications_synccleans up any orphanIN_PROGRESSrows that didn't complete their batch. The redelivered SQS message is then re-ingested cleanly. Net: zero duplicate rows, one SNS result emitted.Swallowing the ack error (best-effort
if let Errwrap) would not be safer here —modifications_synconly runs on startup, so the orphanIN_PROGRESSrow would persist in the running node, and the redelivered message would create a second row that also completes, producing duplicateCOMPLETEDrows and duplicate SNS results.Behavior diff (
process_message)Err(DB failure mid-write)process_identity_match_check_requestsends SNS error response and returns Ok_match arm)Ok(())returned, ack happens at tailDeleteMessagefailure after successful persistErrpropagates → stream tear-down → restart-cleanup of orphan IN_PROGRESS row → redelivery → single fresh processingProvenance
Items 1, 2, and 3 are cherry-picked equivalents of changes in #2007 (continuous-rerand branch, @philsippl), translated to current
main's shape — PR 2007's pre-image had a single inlinematch request_type, whilemainhas since been refactored intoprocess_message→process_message_/process_batch_message(via #2054), so this is not a clean cherry-pick.The rest of #2007 (rerand.rs, s3_coordination.rs, epoch.rs, continuous_rerand.rs, new migrations, deploy values, spec doc, e2e tests) remains parked on
ps/cont-rerand.Test plan
Local against
main(origin/mainat16872f075), Rust 1.95.0 — matches CI:cargo fmt --all -- --check— cleancargo clippy --workspace --all-targets --all-features -- -D warnings --no-deps— cleancargo check -p iris-mpc— cleancargo check -p iris-mpc --features explicit-sns-batching— cleanCI will run the same lints + unit tests + integration tests.
Risk
Low — 3 files, +38 / -7. Behavior changes are strict improvements: "errors no longer eat messages" (item 1), "bad peer-sync input no longer panics the node" (item 3). The cascade tear-down on transient
DeleteMessagefailure is intentional — it's what triggers the cleanup path that prevents duplicate state.