Skip to content

Fix FFM-1520: cover direct first-meme nudge sends#306

Merged
ohld merged 6 commits into
productionfrom
fix/ffm-1520-first-nudge-direct-send
Jun 12, 2026
Merged

Fix FFM-1520: cover direct first-meme nudge sends#306
ohld merged 6 commits into
productionfrom
fix/ffm-1520-first-nudge-direct-send

Conversation

@ohld

@ohld ohld commented Jun 11, 2026

Copy link
Copy Markdown
Member

Fixes FFM-1520.

Root cause

Production current-week data showed 17 delivered new users, with 3 delivered users missing first_meme_nudge assignment. The three misses were all direct-send paths that bypass the original next_message() first-meme assignment path:

  • 2 first deliveries from cold_start_explore via broadcast/direct send
  • 1 first delivery from share_link

Those paths call send_meme_to_user(), which recorded user_meme_reaction without assigning the first-meme nudge experiment for confirmed first deliveries.

Change

  • Assigns first-meme nudge cohort in send_meme_to_user() only after send_new_message_with_meme() succeeds, and before _record_delivered_meme_reaction() records the measured delivery.
  • Defers optional nudge delivery outside the broadcast 20s meme-send timeout, then waits briefly for those tasks without cancelling ambiguous in-flight Telegram sends.
  • Keeps nudge lease cleanup cancellation-safe so ambiguous send outcomes cannot duplicate first-meme nudges.
  • Shields post-Telegram delivery recording so accepted direct sends are still recorded even if the broadcast timeout cancels the per-user task.
  • Adds regression coverage for direct sender ordering, failed direct sends not assigning cohorts, broadcast nudge draining, cancellation cleanup, and retry behavior.

This is the production-targeted version of the reviewed PR #305 fix, applied on top of current origin/production without unrelated local comms/experiment workspace changes.

Verification

  • ruff check --fix src/ tests/ && ruff format src/ tests/
  • git diff --check origin/production...HEAD
  • DATABASE_URL=postgresql+asyncpg://app:app@app_db:5432/app TELEGRAM_BOT_TOKEN=123456:ABCdefghijklmnopqrstuvwxyz pytest tests/tgbot/test_meme_sender.py tests/flows/test_broadcasts_meme.py tests/tgbot/test_first_meme_nudge.py::test_send_cancellation_keeps_lease tests/tgbot/test_first_meme_nudge.py::test_delivery_helper_retries_only_existing_treatment_assignment tests/recommendations/test_pipeline.py::test_compact_diagnostics_exclude_candidate_ids

Result: 14 passed.

@ohld

ohld commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

STAFF ENGINEER REVIEW: CHANGES REQUESTED — one blocking concurrency issue.

[P2] src/flows/broadcasts/meme.py:46-53 + src/tgbot/senders/popups.py:296-301: the broadcast drain cancels pending first-meme nudge tasks after 10s, and the nudge sender releases the popup lease on CancelledError. If Telegram accepted the message but the coroutine is cancelled before returning, the lease is deleted and a later retry can send the same first-meme nudge again. That breaks the popup single-fire invariant in the slow/ambiguous Telegram-send path.

Please adjust the timeout/cancellation behavior so an ambiguous in-flight send cannot release the lease and duplicate the nudge.

Verification I ran: focused local test slice passed, 12 passed (tests/tgbot/test_meme_sender.py, tests/flows/test_broadcasts_meme.py, selected first-meme nudge tests, selected pipeline diagnostics test); git diff --check origin/production...HEAD passed; no secret/token additions found in changed files.

/codex review also flagged this as [P2] Avoid cancelling in-flight nudge sends. Auto-merge was disabled before posting this change request.

@ohld

ohld commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

STAFF ENGINEER REVIEW: CHANGES REQUESTED — Direct first-meme nudge cohort assignment currently happens before Telegram delivery is confirmed. If the first direct send fails before a meme is delivered, the user can still be inserted into first_meme_nudge and counted in v_experiment_results without receiving the measured first meme. Move get_first_meme_nudge_variant_to_send() to after confirmed send_new_message_with_meme() success but before _record_delivered_meme_reaction(), and add regression coverage for a failed direct send not assigning the cohort.

@ohld

ohld commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

STAFF ENGINEER REVIEW: CHANGES REQUESTED — two blocking post-delivery edge cases remain.\n\n[P2] src/tgbot/senders/meme.py:74-78: after Telegram accepts a direct send, cancellation can still land while awaiting get_first_meme_nudge_variant_to_send(), before _record_delivered_meme_reaction() is reached. The shield only protects the later insert, so a delivered meme can still miss user_meme_reaction and later be duplicated or fail reaction updates. Record the delivery before cancellable nudge assignment work, or otherwise make the whole post-delivery section cancellation-safe.\n\n[P2] src/tgbot/senders/meme.py:73-78: send_new_message_with_meme() returns None on Forbidden after marking the user blocked. The new nudge path treats that as a delivered meme and can assign the first-meme experiment / popup lease for an undelivered blocked-user send. Stop before nudge/reaction work when Telegram rejected the meme.\n\nVerification: auto-merge disabled before this comment; git diff --check origin/production...HEAD passed; secret scan on changed files clean; focused pytest with explicit env passed (14 passed): tests/tgbot/test_meme_sender.py, tests/flows/test_broadcasts_meme.py, selected first_meme_nudge tests, selected pipeline diagnostics test. /codex review flagged both items as [P2].

@ohld

ohld commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

STAFF ENGINEER REVIEW: CHANGES REQUESTED

Two production issues in the direct-send timeout path need fixing before this can land:

  1. src/tgbot/senders/meme.py:95-107: after Telegram accepts the meme, _complete_direct_meme_delivery() runs get_first_meme_nudge_variant_to_send() before _record_delivered_meme_reaction(). That helper performs DB reads/inserts through get_experiment_variant() / assign_experiment(). If that post-send experiment assignment fails, the accepted delivery is never inserted into user_meme_reaction, so reaction callbacks/dedupe/stats can miss a meme the user already received. Cancellation is shielded, but ordinary exceptions in the assignment step still lose the delivery row.

  2. src/flows/broadcasts/meme.py:94 plus src/tgbot/senders/meme.py:77-113: when the 20s broadcast timeout cancels _send_to_user() after Telegram accepted the meme but while the shielded delivery task is still assigning/recording, first_meme_nudge_tasks can still be empty when _drain_first_meme_nudge_tasks() runs. The delivery task can append the nudge task later, after the drain returned, leaving the nudge task untracked/unawaited by the broadcast flow. That can drop or hide the treatment nudge in the exact slow path this change is trying to cover.

Verification performed:

  • gh pr view 306 --json state,mergedAt,closedAt: OPEN
  • reviewed git diff origin/production...origin/pr-306-review
  • git diff --check origin/production...origin/pr-306-review: clean
  • focused secret/raw-SQL/security scan: no new secrets, raw SQL, auth, payments, uploads, moderator, integration, or webhook changes found
  • codex review: independently flagged the broadcast orphaned-nudge timeout race

Please keep the invariant simple: once Telegram has accepted the meme, delivery recording must be impossible to skip due to first-nudge experiment work, and any deferred nudge spawned after a timeout must still have an owner/drain/logging path.

@ohld

ohld commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

STAFF ENGINEER REVIEW: CHANGES REQUESTED — Issues found:

[P1] src/tgbot/senders/meme.py:103-107 records user_meme_reaction before assigning the first-meme nudge cohort. Existing code explicitly requires assignment before the measured delivery/reaction path because v_experiment_results joins only r.reacted_at >= ea.assigned_at; this can still drop the first direct-send reaction from the experiment this PR is meant to fix. The PR body also says assignment happens before _record_delivered_meme_reaction, but the final diff does the opposite.

[P2] src/flows/broadcasts/meme.py:31-42 only attaches a callback for nudge tasks appended after the drain closes and does not retain or await them. If the broadcast flow exits soon after, those late tasks can be cancelled with the loop, dropping treatment nudges in the timeout path.

Please keep the post-Telegram-success guard, but restore the invariant that first-meme cohort assignment happens before recording the delivery row, and make late broadcast nudge tasks owned long enough to finish or fail observably.

@ohld

ohld commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

⚠️ Staff Engineer review blocked: manual review of the current head found no new code blocker and local verification passed, but the mandatory /codex review gate did not produce a verdict in this runtime.\n\nEvidence:\n- PR state precheck: OPEN, head 9e7ae0f.\n- Manual diff review: no new SQL interpolation, secrets, auth/payment/upload/moderator/integration/webhook changes, or recommendation blender weight changes found. /cso skipped because the diff did not touch its trigger surfaces.\n- Local verification: git diff --check origin/production...HEAD passed; targeted pytest slice passed (18 passed): tests/tgbot/test_meme_sender.py, tests/flows/test_broadcasts_meme.py, selected first_meme_nudge tests, and selected pipeline diagnostics test.\n- GitHub CI: lint and test are green.\n- /codex review attempt 1: timed out after 330s after bwrap namespace errors.\n- /codex review attempt 2 with sandbox_mode=danger-full-access: timed out after 330s without emitting [P1]/[P2] findings.\n\nNo approval or merge queued because /codex review is a required Staff Engineer gate. Unblock action: rerun Staff Engineer review when Codex can return a verdict, or have ohld explicitly decide to bypass the Codex gate for this PR.

@ohld

ohld commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

STAFF ENGINEER REVIEW: APPROVED — manual review clean; /codex review returned no discrete correctness issue; targeted local tests passed (18 passed); GitHub lint/test are green; no new secrets, SQL interpolation, cso-trigger surfaces, or blender weight changes found.

@ohld ohld merged commit 735becd into production Jun 12, 2026
3 checks passed
@ohld

ohld commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

✅ Approved + merged. Staff Engineer fallback approval posted because GitHub blocks self-review; /codex review returned no discrete correctness issue; CI was green; squash auto-merge merged immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant