Skip to content

[POP-3781] fix: make SQS-modification persistence flow fail-safe#2159

Merged
carlomazzaferro merged 2 commits into
mainfrom
pop-3781-sqs-delete-after-persist
May 27, 2026
Merged

[POP-3781] fix: make SQS-modification persistence flow fail-safe#2159
carlomazzaferro merged 2 commits into
mainfrom
pop-3781-sqs-delete-after-persist

Conversation

@carlomazzaferro
Copy link
Copy Markdown
Collaborator

@carlomazzaferro carlomazzaferro commented May 27, 2026

Summary

Move delete_message in BatchProcessor::process_message from before the per-type processing branch to after it returns Ok, plus two small adjacent hardenings on the same persistence flow.

The bug (POP-3781)

Today, process_message deletes the SQS message immediately after parsing its message-type attribute, before dispatching to the inner processing branch. If the process crashes — or returns Err — between that ack and the modification being durably written to the DB, the message is lost: SQS will not redeliver it, and the DB never received it.

What's in this PR

  1. Delete-after-persist ordering in BatchProcessor::process_message — both the SNS-batching path (#[cfg(feature = "explicit-sns-batching")]) and the normal path now ack only after process_batch_message? / process_message_? returns Ok. The core fix.

  2. delete_message helper hardeningsqs_message.receipt_handle.as_ref().unwrap()ok_or_else(...) returning a new ReceiveRequestError::FailedToMarkRequestAsDeleted(Report) variant. The unwrap was effectively safe (SQS receive always populates receipt_handle), but this function is now invoked at a more critical moment, so converting the panic into a typed error is warranted hygiene.

  3. modifications_sync.rs hygiene (peer-modification-sync apply path) — same conceptual concern (don't take the node down on bad input):

    • modification.clone().s3_url.unwrap().s3_url.clone().ok_or_else(|| eyre!(...))
    • .await?.unwrap() on the awaited JoinHandle → .await?? (propagates the inner Report instead of panicking)
    • panic!("Unknown modification type")Err(eyre!(...)) typed return
    • No behavior change on the happy path; on the unhappy path, peer-sync now returns Err to its caller instead of taking the node down.

Why strict ? propagation on DeleteMessage

If the SQS DeleteMessage call fails after process_message_ has already returned Ok, the error propagates rather than being swallowed. The cascade tear-down of receive_batch_stream (batch.rs:116 — any Err from receive_batch breaks the spawn loop) triggers a node restart, at which point modifications_sync cleans up any orphan IN_PROGRESS rows that didn't complete their batch. The redelivered SQS message is then re-ingested cleanly. Net: zero duplicate rows, one SNS result emitted.

Swallowing the ack error (best-effort if let Err wrap) would not be safer here — modifications_sync only runs on startup, so the orphan IN_PROGRESS row would persist in the running node, and the redelivered message would create a second row that also completes, producing duplicate COMPLETED rows and duplicate SNS results.

Behavior diff (process_message)

Scenario Before After
Happy path (Ok) ack before processing, then process process, then ack
Inner returns Err (DB failure mid-write) message lost (acked, then bailed) message redelivered (not acked, error propagates)
Disabled recovery / reset check acked, then process_identity_match_check_request sends SNS error response and returns Ok unchanged — disabled path still returns Ok, ack happens at tail
Invalid message type (_ match arm) acked, then logged-and-Ok unchanged — Ok(()) returned, ack happens at tail
Process crashes between receive and persist message lost message redelivered
Transient DeleteMessage failure after successful persist n/a (ack was first) Err propagates → stream tear-down → restart-cleanup of orphan IN_PROGRESS row → redelivery → single fresh processing

Provenance

Items 1, 2, and 3 are cherry-picked equivalents of changes in #2007 (continuous-rerand branch, @philsippl), translated to current main's shape — PR 2007's pre-image had a single inline match request_type, while main has since been refactored into process_messageprocess_message_ / process_batch_message (via #2054), so this is not a clean cherry-pick.

The rest of #2007 (rerand.rs, s3_coordination.rs, epoch.rs, continuous_rerand.rs, new migrations, deploy values, spec doc, e2e tests) remains parked on ps/cont-rerand.

Test plan

Local against main (origin/main at 16872f075), Rust 1.95.0 — matches CI:

  • cargo fmt --all -- --check — clean
  • cargo clippy --workspace --all-targets --all-features -- -D warnings --no-deps — clean
  • cargo check -p iris-mpc — clean
  • cargo check -p iris-mpc --features explicit-sns-batching — clean

CI will run the same lints + unit tests + integration tests.

Risk

Low — 3 files, +38 / -7. Behavior changes are strict improvements: "errors no longer eat messages" (item 1), "bad peer-sync input no longer panics the node" (item 3). The cascade tear-down on transient DeleteMessage failure is intentional — it's what triggers the cleanup path that prevents duplicate state.

@github-actions github-actions Bot added the patch label May 27, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 331a2577a2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


self.process_message_(&message, request_type, batch_metadata)
.await?;
self.delete_message(&sqs_message).await?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle post-processing delete failures without aborting

When DeleteMessage returns an error after process_message_ has already persisted/queued the request, this ? aborts poll_exact_messages and receive_batch; receive_batch_stream then stops on the Err before the assembled BatchQuery is sent for execution. In that transient SQS failure scenario the modification row has been inserted but will never be completed by this batch, and the still-visible SQS message can be redelivered and inserted again, leaving stale/duplicate in-progress modifications. The old ordering failed before mutating local/DB state, so this post-processing ack failure needs different handling, e.g. do not discard the batch after successful processing just because the ack failed.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

@carlomazzaferro carlomazzaferro May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1a5af37c9

The post-processing DeleteMessage call is now best-effort: if it transiently fails after process_message_/process_batch_message has already returned Ok, we log tracing::error! + bump a new sqs.delete_failure_after_processing counter and continue, rather than propagate. The modification is already persisted at that point, and downstream appliers are idempotent (modifications are keyed by request_id and apply as full-row overwrites — pre-existing assumption from peer-modification-sync). Visibility-timeout redelivery handles the retry on the SQS side.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we actually want this case to cause a crash and restart so that modifications can be synced again?

As far as I could track, this is already what's happening: if we could persist modification into the db but fail to delete from SQS, we propagate the error up, SQS receive loop is closed, and server tears down, causing a restart on all nodes

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly right — and I had it the wrong way around in my first reply on this thread. The strict ? cascade IS the trigger for the restart-cleanup path: receive_batch_stream tears down → node restart → modifications_sync GCs orphan IN_PROGRESS rows on startup → redelivered SQS message is re-ingested cleanly.

Without strict propagation, the orphan would stick around (modifications_sync only runs on startup) and the redelivered message would produce a parallel COMPLETED row plus a duplicate SNS result. So the cascade isn't just a circuit breaker — it's what causes the cleanup to fire.

Code comment block at the top of process_message is updated in the latest force-push (19a404d24) to spell this out instead of referencing the (now-cancelled) POP-3897. Thanks for the catch.

@carlomazzaferro carlomazzaferro force-pushed the pop-3781-sqs-delete-after-persist branch from 331a257 to 1a5af37 Compare May 27, 2026 06:57
@carlomazzaferro carlomazzaferro changed the title [POP-3781] fix: delete SQS message only after successful processing [POP-3781] fix: make SQS-modification persistence flow fail-safe May 27, 2026
@carlomazzaferro
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a5af37c90

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +462 to +469
if let Err(e) = self.delete_message(&sqs_message).await {
tracing::error!(
"SQS DeleteMessage failed after successful processing (type={}): {:?}",
request_type,
e
);
metrics::counter!("sqs.delete_failure_after_processing", "type" => request_type.to_string())
.increment(1);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't swallow post-persist SQS delete failures

When DeleteMessage fails after a request has been persisted, this path now returns Ok(()), so the same SQS message can become visible again and be processed a second time. The redelivery is not idempotent at the DB boundary: Store::insert_modification unconditionally inserts a new modifications row (iris-mpc-store/src/lib.rs:491-519), and the table has no request-id key to collapse duplicates, so a transient SQS delete failure can create duplicate modifications/results for the same enrollment/update instead of surfacing an error and stopping before another batch is accepted.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

@carlomazzaferro carlomazzaferro May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm that's right. Verified: Store::insert_modification at iris-mpc-store/src/lib.rs:491-519 is plain INSERT ... VALUES with no ON CONFLICT, and the modifications table schema has neither a request_id column nor a UNIQUE constraint on (serial_id, request_type). The SQS message body carries request_id but it's never threaded into the modifications table.

Resolution in commit 31d08eeb7 (force-pushed):

  1. Reverted the best-effort wrapDeleteMessage is back on strict ? propagation. The cascade tear-down of receive_batch_stream on ack failure is now intentional, retained as a circuit breaker against duplicate-row accumulation while SQS is unhealthy. The duplicate window narrows from "every redelivered message creates a duplicate" to "one in-flight message at the moment ack failed, then ingestion stops until the stream restarts."
  2. Follow-up filed: POP-3897 — Make modifications-table inserts idempotent on SQS request_id. Proper fix is request_id TEXT NOT NULL UNIQUE (or UNIQUE (serial_id, request_type)) + INSERT ... ON CONFLICT DO NOTHING + plumbing through Store::insert_modification from the SQS body.

@carlomazzaferro carlomazzaferro force-pushed the pop-3781-sqs-delete-after-persist branch from 1a5af37 to 31d08ee Compare May 27, 2026 07:18
Comment on lines +440 to +448
// receive_batch breaks the spawn loop). That tear-down is
// intentional in this PR: the modifications table is not yet
// idempotent at the DB boundary (plain INSERT, no request_id
// key — see POP-3897), so swallowing the ack failure and
// continuing would let visibility-timeout redelivery pile up
// duplicate modification rows. The cascade tear-down acts as a
// circuit breaker against duplicate accumulation while SQS is
// unhealthy. Once POP-3897 lands, this comment block and the
// strict propagation can be revisited.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still valid, I see POP-3897 being cancelled? 🤔

Copy link
Copy Markdown
Collaborator Author

@carlomazzaferro carlomazzaferro May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the comment block in 19a404d24 to drop the POP-3897 link and replace the 'circuit breaker against duplicate accumulation' framing with the actual restart-cleanup mechanism (matches the thread above)

eaypek-tfh
eaypek-tfh previously approved these changes May 27, 2026
Move delete_message in BatchProcessor::process_message from before the
per-type processing branch to after it returns Ok. The previous ordering
acked the SQS message before the modification was durably persisted, so
a crash in that window silently dropped the message. After this change,
Err from inner processing propagates without acking, so SQS visibility-
timeout redelivery retries the message.

DeleteMessage itself uses strict `?` propagation. If the ack fails
after processing has succeeded, the Err tears down receive_batch_stream
(batch.rs:116) — intentional, because the modifications table is not
idempotent at the DB boundary (plain INSERT, no request_id key, see
iris-mpc-store/src/lib.rs:491-519). The cascade acts as a circuit
breaker against duplicate-row accumulation while SQS is unhealthy.
Proper modifications-table idempotency is tracked in POP-3897.

Two adjacent hardenings on the same persistence flow:
- delete_message helper: receipt_handle.unwrap() -> ok_or_else with a
  new ReceiveRequestError::FailedToMarkRequestAsDeleted variant
- modifications_sync.rs (peer-sync apply path): panic!/unwrap() on
  unknown modification type, missing s3_url, and JoinHandle Result
  converted to typed Err returns. No happy-path behavior change.

Cherry-picked equivalents of changes in #2007 (continuous-rerand
branch, @philsippl), translated to current main shape; the rest of
#2007 remains parked on ps/cont-rerand.

Linear: https://linear.app/worldcoin/issue/POP-3781
Follow-up: https://linear.app/worldcoin/issue/POP-3897
@carlomazzaferro carlomazzaferro merged commit dfe8727 into main May 27, 2026
21 checks passed
@carlomazzaferro carlomazzaferro deleted the pop-3781-sqs-delete-after-persist branch May 27, 2026 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants