[POP-3781] fix: make SQS-modification persistence flow fail-safe by carlomazzaferro · Pull Request #2159 · worldcoin/iris-mpc

carlomazzaferro · 2026-05-27T06:41:34Z

Summary

Move delete_message in BatchProcessor::process_message from before the per-type processing branch to after it returns Ok, plus two small adjacent hardenings on the same persistence flow.

The bug (POP-3781)

Today, process_message deletes the SQS message immediately after parsing its message-type attribute, before dispatching to the inner processing branch. If the process crashes — or returns Err — between that ack and the modification being durably written to the DB, the message is lost: SQS will not redeliver it, and the DB never received it.

What's in this PR

Delete-after-persist ordering in BatchProcessor::process_message — both the SNS-batching path (#[cfg(feature = "explicit-sns-batching")]) and the normal path now ack only after process_batch_message? / process_message_? returns Ok. The core fix.
delete_message helper hardening — sqs_message.receipt_handle.as_ref().unwrap() → ok_or_else(...) returning a new ReceiveRequestError::FailedToMarkRequestAsDeleted(Report) variant. The unwrap was effectively safe (SQS receive always populates receipt_handle), but this function is now invoked at a more critical moment, so converting the panic into a typed error is warranted hygiene.
modifications_sync.rs hygiene (peer-modification-sync apply path) — same conceptual concern (don't take the node down on bad input):
- modification.clone().s3_url.unwrap() → .s3_url.clone().ok_or_else(|| eyre!(...))
- .await?.unwrap() on the awaited JoinHandle → .await?? (propagates the inner Report instead of panicking)
- panic!("Unknown modification type") → Err(eyre!(...)) typed return
- No behavior change on the happy path; on the unhappy path, peer-sync now returns Err to its caller instead of taking the node down.

Why strict `?` propagation on `DeleteMessage`

If the SQS DeleteMessage call fails after process_message_ has already returned Ok, the error propagates rather than being swallowed. The cascade tear-down of receive_batch_stream (batch.rs:116 — any Err from receive_batch breaks the spawn loop) triggers a node restart, at which point modifications_sync cleans up any orphan IN_PROGRESS rows that didn't complete their batch. The redelivered SQS message is then re-ingested cleanly. Net: zero duplicate rows, one SNS result emitted.

Swallowing the ack error (best-effort if let Err wrap) would not be safer here — modifications_sync only runs on startup, so the orphan IN_PROGRESS row would persist in the running node, and the redelivered message would create a second row that also completes, producing duplicate COMPLETED rows and duplicate SNS results.

Behavior diff (`process_message`)

Scenario	Before	After
Happy path (Ok)	ack before processing, then process	process, then ack
Inner returns `Err` (DB failure mid-write)	message lost (acked, then bailed)	message redelivered (not acked, error propagates)
Disabled recovery / reset check	acked, then `process_identity_match_check_request` sends SNS error response and returns Ok	unchanged — disabled path still returns Ok, ack happens at tail
Invalid message type (`_` match arm)	acked, then logged-and-Ok	unchanged — `Ok(())` returned, ack happens at tail
Process crashes between receive and persist	message lost	message redelivered
Transient `DeleteMessage` failure after successful persist	n/a (ack was first)	`Err` propagates → stream tear-down → restart-cleanup of orphan IN_PROGRESS row → redelivery → single fresh processing

Provenance

Items 1, 2, and 3 are cherry-picked equivalents of changes in #2007 (continuous-rerand branch, @philsippl), translated to current main's shape — PR 2007's pre-image had a single inline match request_type, while main has since been refactored into process_message → process_message_ / process_batch_message (via #2054), so this is not a clean cherry-pick.

The rest of #2007 (rerand.rs, s3_coordination.rs, epoch.rs, continuous_rerand.rs, new migrations, deploy values, spec doc, e2e tests) remains parked on ps/cont-rerand.

Test plan

Local against main (origin/main at 16872f075), Rust 1.95.0 — matches CI:

cargo fmt --all -- --check — clean
cargo clippy --workspace --all-targets --all-features -- -D warnings --no-deps — clean
cargo check -p iris-mpc — clean
cargo check -p iris-mpc --features explicit-sns-batching — clean

CI will run the same lints + unit tests + integration tests.

Risk

Low — 3 files, +38 / -7. Behavior changes are strict improvements: "errors no longer eat messages" (item 1), "bad peer-sync input no longer panics the node" (item 3). The cascade tear-down on transient DeleteMessage failure is intentional — it's what triggers the cleanup path that prevents duplicate state.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 331a2577a2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T06:44:06Z


        self.process_message_(&message, request_type, batch_metadata)
            .await?;
+        self.delete_message(&sqs_message).await?;


Handle post-processing delete failures without aborting

When DeleteMessage returns an error after process_message_ has already persisted/queued the request, this ? aborts poll_exact_messages and receive_batch; receive_batch_stream then stops on the Err before the assembled BatchQuery is sent for execution. In that transient SQS failure scenario the modification row has been inserted but will never be completed by this batch, and the still-visible SQS message can be redelivered and inserted again, leaving stale/duplicate in-progress modifications. The old ordering failed before mutating local/DB state, so this post-processing ack failure needs different handling, e.g. do not discard the batch after successful processing just because the ack failed.

Useful? React with 👍 / 👎.

Addressed in commit 1a5af37c9

The post-processing DeleteMessage call is now best-effort: if it transiently fails after process_message_/process_batch_message has already returned Ok, we log tracing::error! + bump a new sqs.delete_failure_after_processing counter and continue, rather than propagate. The modification is already persisted at that point, and downstream appliers are idempotent (modifications are keyed by request_id and apply as full-row overwrites — pre-existing assumption from peer-modification-sync). Visibility-timeout redelivery handles the retry on the SQS side.

Don't we actually want this case to cause a crash and restart so that modifications can be synced again?

As far as I could track, this is already what's happening: if we could persist modification into the db but fail to delete from SQS, we propagate the error up, SQS receive loop is closed, and server tears down, causing a restart on all nodes

Exactly right — and I had it the wrong way around in my first reply on this thread. The strict ? cascade IS the trigger for the restart-cleanup path: receive_batch_stream tears down → node restart → modifications_sync GCs orphan IN_PROGRESS rows on startup → redelivered SQS message is re-ingested cleanly.

Without strict propagation, the orphan would stick around (modifications_sync only runs on startup) and the redelivered message would produce a parallel COMPLETED row plus a duplicate SNS result. So the cascade isn't just a circuit breaker — it's what causes the cleanup to fire.

Code comment block at the top of process_message is updated in the latest force-push (19a404d24) to spell this out instead of referencing the (now-cancelled) POP-3897. Thanks for the catch.

carlomazzaferro · 2026-05-27T07:04:42Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a5af37c90

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T07:07:07Z

+        if let Err(e) = self.delete_message(&sqs_message).await {
+            tracing::error!(
+                "SQS DeleteMessage failed after successful processing (type={}): {:?}",
+                request_type,
+                e
+            );
+            metrics::counter!("sqs.delete_failure_after_processing", "type" => request_type.to_string())
+                .increment(1);


Don't swallow post-persist SQS delete failures

When DeleteMessage fails after a request has been persisted, this path now returns Ok(()), so the same SQS message can become visible again and be processed a second time. The redelivery is not idempotent at the DB boundary: Store::insert_modification unconditionally inserts a new modifications row (iris-mpc-store/src/lib.rs:491-519), and the table has no request-id key to collapse duplicates, so a transient SQS delete failure can create duplicate modifications/results for the same enrollment/update instead of surfacing an error and stopping before another batch is accepted.

Useful? React with 👍 / 👎.

hmm that's right. Verified: Store::insert_modification at iris-mpc-store/src/lib.rs:491-519 is plain INSERT ... VALUES with no ON CONFLICT, and the modifications table schema has neither a request_id column nor a UNIQUE constraint on (serial_id, request_type). The SQS message body carries request_id but it's never threaded into the modifications table.

Resolution in commit 31d08eeb7 (force-pushed):

Reverted the best-effort wrap — DeleteMessage is back on strict ? propagation. The cascade tear-down of receive_batch_stream on ack failure is now intentional, retained as a circuit breaker against duplicate-row accumulation while SQS is unhealthy. The duplicate window narrows from "every redelivered message creates a duplicate" to "one in-flight message at the moment ack failed, then ingestion stops until the stream restarts."

Follow-up filed: POP-3897 — Make modifications-table inserts idempotent on SQS request_id. Proper fix is request_id TEXT NOT NULL UNIQUE (or UNIQUE (serial_id, request_type)) + INSERT ... ON CONFLICT DO NOTHING + plumbing through Store::insert_modification from the SQS body.

eaypek-tfh · 2026-05-27T12:05:24Z

+        // receive_batch breaks the spawn loop). That tear-down is
+        // intentional in this PR: the modifications table is not yet
+        // idempotent at the DB boundary (plain INSERT, no request_id
+        // key — see POP-3897), so swallowing the ack failure and
+        // continuing would let visibility-timeout redelivery pile up
+        // duplicate modification rows. The cascade tear-down acts as a
+        // circuit breaker against duplicate accumulation while SQS is
+        // unhealthy. Once POP-3897 lands, this comment block and the
+        // strict propagation can be revisited.


is this still valid, I see POP-3897 being cancelled? 🤔

Rewrote the comment block in 19a404d24 to drop the POP-3897 link and replace the 'circuit breaker against duplicate accumulation' framing with the actual restart-cleanup mechanism (matches the thread above)

@philsippl

Move delete_message in BatchProcessor::process_message from before the per-type processing branch to after it returns Ok. The previous ordering acked the SQS message before the modification was durably persisted, so a crash in that window silently dropped the message. After this change, Err from inner processing propagates without acking, so SQS visibility- timeout redelivery retries the message. DeleteMessage itself uses strict `?` propagation. If the ack fails after processing has succeeded, the Err tears down receive_batch_stream (batch.rs:116) — intentional, because the modifications table is not idempotent at the DB boundary (plain INSERT, no request_id key, see iris-mpc-store/src/lib.rs:491-519). The cascade acts as a circuit breaker against duplicate-row accumulation while SQS is unhealthy. Proper modifications-table idempotency is tracked in POP-3897. Two adjacent hardenings on the same persistence flow: - delete_message helper: receipt_handle.unwrap() -> ok_or_else with a new ReceiveRequestError::FailedToMarkRequestAsDeleted variant - modifications_sync.rs (peer-sync apply path): panic!/unwrap() on unknown modification type, missing s3_url, and JoinHandle Result converted to typed Err returns. No happy-path behavior change. Cherry-picked equivalents of changes in #2007 (continuous-rerand branch, @philsippl), translated to current main shape; the rest of #2007 remains parked on ps/cont-rerand. Linear: https://linear.app/worldcoin/issue/POP-3781 Follow-up: https://linear.app/worldcoin/issue/POP-3897

github-actions Bot added the patch label May 27, 2026

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

carlomazzaferro force-pushed the pop-3781-sqs-delete-after-persist branch from 331a257 to 1a5af37 Compare May 27, 2026 06:57

carlomazzaferro changed the title ~~[POP-3781] fix: delete SQS message only after successful processing~~ [POP-3781] fix: make SQS-modification persistence flow fail-safe May 27, 2026

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

carlomazzaferro force-pushed the pop-3781-sqs-delete-after-persist branch from 1a5af37 to 31d08ee Compare May 27, 2026 07:18

eaypek-tfh reviewed May 27, 2026

View reviewed changes

eaypek-tfh previously approved these changes May 27, 2026

View reviewed changes

carlomazzaferro dismissed eaypek-tfh’s stale review via 19a404d May 27, 2026 14:28

carlomazzaferro force-pushed the pop-3781-sqs-delete-after-persist branch from 31d08ee to 19a404d Compare May 27, 2026 14:28

Merge branch 'main' into pop-3781-sqs-delete-after-persist

13e27d2

eaypek-tfh approved these changes May 27, 2026

View reviewed changes

carlomazzaferro merged commit dfe8727 into main May 27, 2026
21 checks passed

carlomazzaferro deleted the pop-3781-sqs-delete-after-persist branch May 27, 2026 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[POP-3781] fix: make SQS-modification persistence flow fail-safe#2159

[POP-3781] fix: make SQS-modification persistence flow fail-safe#2159
carlomazzaferro merged 2 commits into
mainfrom
pop-3781-sqs-delete-after-persist

carlomazzaferro commented May 27, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Uh oh!

carlomazzaferro May 27, 2026 •

edited

Loading

Uh oh!

eaypek-tfh May 27, 2026

Uh oh!

carlomazzaferro May 27, 2026

Uh oh!

carlomazzaferro commented May 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Uh oh!

carlomazzaferro May 27, 2026 •

edited

Loading

Uh oh!

eaypek-tfh May 27, 2026

Uh oh!

carlomazzaferro May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

carlomazzaferro commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The bug (POP-3781)

What's in this PR

Why strict ? propagation on DeleteMessage

Behavior diff (process_message)

Provenance

Test plan

Risk

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

carlomazzaferro May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eaypek-tfh May 27, 2026

Choose a reason for hiding this comment

Uh oh!

carlomazzaferro May 27, 2026

Choose a reason for hiding this comment

Uh oh!

carlomazzaferro commented May 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

carlomazzaferro May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eaypek-tfh May 27, 2026

Choose a reason for hiding this comment

Uh oh!

carlomazzaferro May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

carlomazzaferro commented May 27, 2026 •

edited

Loading

Why strict `?` propagation on `DeleteMessage`

Behavior diff (`process_message`)

carlomazzaferro May 27, 2026 •

edited

Loading

carlomazzaferro May 27, 2026 •

edited

Loading

carlomazzaferro May 27, 2026 •

edited

Loading