fix: voicemail ghost call loop - IVR repeating for 49 minutes#338
Conversation
… stripe-retry, rate-limit - archiveRecordingToStorage: 21 tests covering happy path, extension detection, upload failure, signedUrl failure, call_update_failures insert path, failure insert error logging, and missing callRecord skip - health-checks.ts: 23 tests (0% -> 100%) — checkTelnyxCCA, checkTelnyxCredentialConnection, checkSupabaseReachability, EXPECTED_WEBHOOK_URL derivation, all error/timeout paths - stripe-retry.ts: 11 tests (41% -> 100%) — isStripeConnectionError, withStripeRetry happy path, connection error passthrough, 429 retry with retry-after cap, status fallback - rate-limit.ts: 12 tests added (64% -> 93%) — isRateLimited, peekRateLimit, getRateLimitRetryAfterSecs, refundRateLimit, clearRateLimit Total: 975 -> 1030 tests (+55), all passing
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThe PR enhances agent presence and heartbeat system by adding visual connection status indicators, refining heartbeat transport failure handling, and adjusting the session presence restore logic. These changes improve connection state visibility and error differentiation across the presence infrastructure. Changes
Sequence DiagramsequenceDiagram
participant User as User/Client
participant Header as Header Component
participant Hook as usePresence Hook
participant Storage as Session Storage
participant Fetch as Fetch API
participant Primary as Primary Heartbeat
participant Fallback as Fallback Endpoint
participant DB as Supabase
rect rgba(100, 150, 200, 0.5)
Note over User,DB: Foreground Heartbeat with Transport Failure Handling
User->>Header: App in focus (visible)
Header->>Hook: sendHeartbeat() triggered
Hook->>Fetch: try: POST /api/agents/heartbeat
alt Transport failure
Fetch-->>Hook: throw (network error)
Hook->>Hook: primaryTransportFailed = true
Hook->>Fallback: POST /api/agents/heartbeat/fallback
Fallback->>DB: attempt agent_presence touch
DB-->>Fallback: error (500 response)
Fallback-->>Hook: error response
else Successful response
Primary->>DB: update agent_presence
DB-->>Primary: success
Primary-->>Hook: OK response
end
Hook->>Hook: Update connectionState badge
Hook->>Header: Render status (online/degraded)
end
rect rgba(150, 100, 200, 0.5)
Note over User,DB: Background Heartbeat (sendBeacon)
User->>Header: App in background (hidden)
Header->>Hook: sendHeartbeat() triggered
Hook->>Fetch: navigator.sendBeacon (no await)
Hook->>Hook: Return immediately
Note over Hook: No timestamp updates, no broadcast
end
rect rgba(150, 200, 100, 0.5)
Note over User,DB: Session-Based Presence Restore
User->>Hook: Page refresh/session restore
Hook->>Storage: Read session storage
Storage-->>Hook: Restore data + age
Hook->>Hook: Check: age < RESTORE_MAX_AGE_MS
alt Valid restore
Hook->>Hook: Restore from session (no recency gate)
Hook->>Header: connectionState = restored status
else Invalid
Hook->>Hook: connectionState = offline
end
end
rect rgba(200, 150, 100, 0.5)
Note over User,Header: User Reconnection Action
Header->>User: Display degraded/offline banner
User->>Header: Click "Click here to reconnect"
Header->>Hook: reconnectPresence()
Hook->>Primary: POST /api/agents/heartbeat
Primary-->>Hook: success
Hook->>Header: Update connectionState = online
Header->>User: Hide banner, show online badge
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1a071ca6bf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const stillStuck = Boolean(callRow && !callRow.ended_at && !callRow.recording_url && !callRow.bridged_at) | ||
| if (stillStuck) { | ||
| console.error('[Telnyx] Voicemail watchdog fired - no recording after prompt, force hangup:', { | ||
| callId, | ||
| callControlId, | ||
| }) | ||
| await hangupCall(callControlId).catch(() => {}) |
There was a problem hiding this comment.
Avoid terminating active voicemail recordings after 30s
The new watchdog in handleSpeakEnded treats !recording_url after 30 seconds as a stuck voicemail and force-hangs up the call, but recording_url is only populated later in handleRecordingSaved after the recording completes. In practice, any caller who speaks for longer than ~30 seconds (or any delayed recording.saved webhook) will be cut off mid-message even though record_start succeeded. This introduces a production regression for normal long voicemails.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/__tests__/health-checks-lib.test.ts`:
- Around line 146-154: The test contains a dead branch: in the it.each block
testing checkTelnyxCredentialConnection the conditional if (!response)
expect(mockFetch).not.toHaveBeenCalled() can never run because every case
supplies a truthy response; remove this unreachable assertion or instead add a
test case with response: null if the intended behavior is to skip fetch when env
is missing. Update the test around checkTelnyxCredentialConnection (and the
related mockFetch setup) to either delete the !response expectation or add an
explicit null-response case to exercise the branch.
In `@src/app/api/agents/heartbeat/fallback/route.ts`:
- Around line 26-31: The handler currently maps a failed update to
agent_presence (the admin.from('agent_presence').update(...) call) to a 400 via
touchError and NextResponse.json; change this to return a 5xx (e.g., status:
500) so server-side write failures are classified as server errors. Update the
branch that checks touchError (the touchError variable) to return
NextResponse.json({ error: 'Internal error' }, { status: 500 }) (or another
appropriate 5xx) instead of 400 so retry/alerting logic treats it as a backend
failure.
In `@src/hooks/usePresence.ts`:
- Around line 259-275: The heartbeat code in usePresence.ts currently calls the
fallback endpoint for any non-OK response; change it to only call the fallback
when the primary fetch failed due to a transport error/timeout (i.e., the fetch
threw/caught and res is null) or optionally for explicit gateway 5xx statuses
(e.g., 502/503/504) if you want to allow selected server errors; update the
logic around the variables res, fallbackRes, heartbeatOk, and lastStatus so
that: if res?.ok set heartbeatOk=true, else set lastStatus=res?.status and only
invoke fetch('/api/agents/heartbeat/fallback', ...) when res is null (or when
res.status is in the configured gateway-5xx list); do not call the fallback for
ordinary 4xx/other 5xx responses.
- Around line 28-30: The code currently treats navigator.sendBeacon() success as
a confirmed heartbeat; change this so sharedLastApiHeartbeatAt and the online
flag are only updated after a real foreground fetch returns a 2xx response.
Specifically, stop setting sharedLastApiHeartbeatAt and online when sendBeacon()
returns true; instead perform a fetch (or await visibility-appropriate request)
to the heartbeat endpoint and only on response.ok update
sharedLastApiHeartbeatAt, set online = true, and dispatch the shared DOM event
(so other instances see a confirmed heartbeat). Ensure any fallback/queueing
from sendBeacon() does not mark the app online.
In `@src/lib/webhooks/handlers.ts`:
- Around line 2913-2916: The TTL guard is using the current webhook leg's
callControlId which can be the agent leg (causing agents to be hung up) instead
of targeting the inbound call; update the enforceUnbridgedCallTtl call in the
handleSpeakEnded path to use state.inbound_call_control_id when present (fall
back to callControlId if not) so the TTL check and subsequent hangup apply to
the inbound leg; adjust the call to enforceUnbridgedCallTtl (and any derived
log/context string such as `handleSpeakEnded:${String(state.action ||
'unknown')}`) to reference state.inbound_call_control_id where available.
- Around line 1512-1522: The voicemail-marker write must be treated as critical:
check the result (or catch errors) from updateCallRecordWithRetry when called
with match: { id: callId } and context `${context}:mark_voicemail_pending`, and
if the write failed (falsy result or caught error) abort the voicemail flow
instead of continuing; e.g., log the failure and return/stop so
recoverIvrToQueueOnSpeakEnd cannot rely on an unwritten breadcrumb (do not
proceed to play the voicemail prompt or set in-memory flags when
updateCallRecordWithRetry fails).
- Around line 3008-3034: The watchdog currently assumes absence of recording_url
means a stuck call and forces hangup in the setTimeout block (the stillStuck
check using ended_at, recording_url, bridged_at), which cuts off long
recordings; either extend the timeout to exceed the configured recording
max_length plus a buffer (e.g., replace the 30000 ms delay with maxLengthMs +
buffer) or change the logic to rely on explicit
recording-start/recording-complete signals (store a
recording_started/recording_completed flag in the calls row and check that
instead of recording_url) before calling hangupCall(callControlId); update the
stillStuck computation and timeout invocation in the watchdog setTimeout
accordingly so valid long voicemails are not terminated prematurely.
In `@supabase/migrations/20260401161000_presence_stale_2x_heartbeat.sql`:
- Around line 1-2: The migration comment and threshold in
presence_stale_2x_heartbeat.sql claim a "2x heartbeat (60s)" cutoff but the cron
that invokes the sweep (src/app/api/cron/sweep-presence/route.ts) is scheduled
every 2 minutes in vercel.json, so agents will be swept ~60–179s after last
heartbeat; update either the cron or the migration to match: either change
vercel.json to run the sweep every minute (cron "* * * * *") so the SQL's 2x
heartbeat semantics are accurate, or change the migration comment and threshold
logic in presence_stale_2x_heartbeat.sql to reflect a wider cutoff (e.g., 4x
heartbeat or explicit 120s+) and ensure any code reading that constant
(sweep-presence route) uses the same value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5623d576-1062-4543-a3d5-8caf3bb4030d
📒 Files selected for processing (10)
src/__tests__/archive-recording.test.tssrc/__tests__/health-checks-lib.test.tssrc/__tests__/stripe-retry-lib.test.tssrc/app/api/agents/heartbeat/fallback/route.tssrc/app/api/agents/heartbeat/route.tssrc/components/layout/header.tsxsrc/hooks/usePresence.tssrc/lib/presence/drain-queue-trigger.tssrc/lib/webhooks/handlers.tssupabase/migrations/20260401161000_presence_stale_2x_heartbeat.sql
Findings addressed or minor - merging.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 427fdb7102
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| heartbeatOk = true | ||
| } else { | ||
| lastStatus = fallbackRes?.status ?? lastStatus | ||
| if (primaryTransportFailed) { |
There was a problem hiding this comment.
Run fallback heartbeat on non-2xx primary responses
The fallback heartbeat is now gated behind primaryTransportFailed, so it only runs when fetch throws and is skipped for normal HTTP failures like 500/503. That removes the recovery path this code previously used for server-side primary endpoint errors, causing avoidable heartbeat failures and eventually forcing agents into degraded/offline state even when /api/agents/heartbeat/fallback could still succeed.
Useful? React with 👍 / 👎.
| parsed.status !== 'offline' && | ||
| ageMs <= PRESENCE_RESTORE_MAX_AGE_MS && | ||
| heartbeatAgeMs <= HEARTBEAT_RECENCY_MS | ||
| heartbeatAgeMs <= PRESENCE_RESTORE_MAX_AGE_MS |
There was a problem hiding this comment.
Keep reload restore heartbeat window above 30s interval
The reload restore check now requires heartbeatAgeMs <= PRESENCE_RESTORE_MAX_AGE_MS (15s), but heartbeats are emitted every 30s. If the unload beacon is dropped or delayed, a normal healthy session can have a 15–30s old heartbeat and fail the restore gate, leaving the agent unexpectedly offline after refresh. The previous 60s recency window avoided this mismatch with the heartbeat cadence.
Useful? React with 👍 / 👎.
* test: full audit coverage — archiveRecordingToStorage, health-checks, stripe-retry, rate-limit - archiveRecordingToStorage: 21 tests covering happy path, extension detection, upload failure, signedUrl failure, call_update_failures insert path, failure insert error logging, and missing callRecord skip - health-checks.ts: 23 tests (0% -> 100%) — checkTelnyxCCA, checkTelnyxCredentialConnection, checkSupabaseReachability, EXPECTED_WEBHOOK_URL derivation, all error/timeout paths - stripe-retry.ts: 11 tests (41% -> 100%) — isStripeConnectionError, withStripeRetry happy path, connection error passthrough, 429 retry with retry-after cap, status fallback - rate-limit.ts: 12 tests added (64% -> 93%) — isRateLimited, peekRateLimit, getRateLimitRetryAfterSecs, refundRateLimit, clearRateLimit Total: 975 -> 1030 tests (+55), all passing * fix: address CodeRabbit test quality findings on PR #336 * fix: strengthen test assertions per CodeRabbit review on PR #336 * fix: add missing assertions in archive-recording tests * Harden agent presence heartbeat and surface offline state * fix: address CodeRabbit findings on PR #337 * fix presence stale-check error handling and codrabbit test findings * fix: final test assertion hardening per CodeRabbit * Fix ghost voicemail loops and add hangup safeguards * fix: resolve CodeRabbit findings for presence and voicemail flow
Fixes ghost call where IVR repeated for 49min with no hangup. Root cause: speak.ended recovery re-routed voicemail calls back to ACD creating infinite loop. Adds voicemail loop prevention, 30s watchdog timeout, record_start failure handling, and broader 10min TTL safety net.
Summary by CodeRabbit
New Features
Bug Fixes
Chores