Skip to content

fix: voicemail ghost call loop - IVR repeating for 49 minutes#338

Merged
coygg merged 12 commits into
masterfrom
fix/voicemail-ghost-call
Apr 2, 2026
Merged

fix: voicemail ghost call loop - IVR repeating for 49 minutes#338
coygg merged 12 commits into
masterfrom
fix/voicemail-ghost-call

Conversation

@coygg
Copy link
Copy Markdown
Collaborator

@coygg coygg commented Apr 2, 2026

Fixes ghost call where IVR repeated for 49min with no hangup. Root cause: speak.ended recovery re-routed voicemail calls back to ACD creating infinite loop. Adds voicemail loop prevention, 30s watchdog timeout, record_start failure handling, and broader 10min TTL safety net.

Summary by CodeRabbit

  • New Features

    • Added connection status indicator (online/degraded/offline) in the header
    • Added reconnect button when connection is compromised
  • Bug Fixes

    • Improved heartbeat transport resilience with fallback error handling
    • Enhanced presence session restoration logic
  • Chores

    • Optimized presence cleanup background task schedule

coygg added 10 commits March 26, 2026 08:24
… stripe-retry, rate-limit

- archiveRecordingToStorage: 21 tests covering happy path, extension detection,
  upload failure, signedUrl failure, call_update_failures insert path,
  failure insert error logging, and missing callRecord skip
- health-checks.ts: 23 tests (0% -> 100%) — checkTelnyxCCA, checkTelnyxCredentialConnection,
  checkSupabaseReachability, EXPECTED_WEBHOOK_URL derivation, all error/timeout paths
- stripe-retry.ts: 11 tests (41% -> 100%) — isStripeConnectionError, withStripeRetry
  happy path, connection error passthrough, 429 retry with retry-after cap, status fallback
- rate-limit.ts: 12 tests added (64% -> 93%) — isRateLimited, peekRateLimit,
  getRateLimitRetryAfterSecs, refundRateLimit, clearRateLimit

Total: 975 -> 1030 tests (+55), all passing
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 2, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 13948cd7-a751-4f5d-a1b2-9b8ca79783a7

📥 Commits

Reviewing files that changed from the base of the PR and between 1a071ca and 427fdb7.

📒 Files selected for processing (4)
  • src/__tests__/health-checks-lib.test.ts
  • src/app/api/agents/heartbeat/fallback/route.ts
  • src/hooks/usePresence.ts
  • vercel.json

📝 Walkthrough

Walkthrough

The PR enhances agent presence and heartbeat system by adding visual connection status indicators, refining heartbeat transport failure handling, and adjusting the session presence restore logic. These changes improve connection state visibility and error differentiation across the presence infrastructure.

Changes

Cohort / File(s) Summary
Presence Heartbeat Logic
src/hooks/usePresence.ts, src/app/api/agents/heartbeat/fallback/route.ts
Modified heartbeat background behavior to use navigator.sendBeacon without updating shared timestamps; foreground heartbeats now distinguish transport failures via try/catch and only attempt fallback endpoint on transport error. Session restore logic simplified by removing heartbeat recency gate. Fallback route error response changed from 400 to 500 status.
Header UI & Presence
src/components/layout/header.tsx
Integrated usePresence hook to display connection status badge (online/degraded/offline variants) next to user role. Added conditional banner informing users of degraded/offline connection with "Click here to reconnect" button. Restructured JSX from single header root to fragment to accommodate new banner element.
Configuration & Cron
vercel.json
Updated presence sweep cron schedule from every 2 minutes (*/2 * * * *) to every 1 minute (* * * * *).
Test Coverage
src/__tests__/health-checks-lib.test.ts
Removed assertion verifying mockFetch is not called when test scenario lacks mock HTTP response, reducing coverage for "missing/invalid prerequisites" cases.

Sequence Diagram

sequenceDiagram
    participant User as User/Client
    participant Header as Header Component
    participant Hook as usePresence Hook
    participant Storage as Session Storage
    participant Fetch as Fetch API
    participant Primary as Primary Heartbeat
    participant Fallback as Fallback Endpoint
    participant DB as Supabase

    rect rgba(100, 150, 200, 0.5)
        Note over User,DB: Foreground Heartbeat with Transport Failure Handling
        User->>Header: App in focus (visible)
        Header->>Hook: sendHeartbeat() triggered
        Hook->>Fetch: try: POST /api/agents/heartbeat
        alt Transport failure
            Fetch-->>Hook: throw (network error)
            Hook->>Hook: primaryTransportFailed = true
            Hook->>Fallback: POST /api/agents/heartbeat/fallback
            Fallback->>DB: attempt agent_presence touch
            DB-->>Fallback: error (500 response)
            Fallback-->>Hook: error response
        else Successful response
            Primary->>DB: update agent_presence
            DB-->>Primary: success
            Primary-->>Hook: OK response
        end
        Hook->>Hook: Update connectionState badge
        Hook->>Header: Render status (online/degraded)
    end

    rect rgba(150, 100, 200, 0.5)
        Note over User,DB: Background Heartbeat (sendBeacon)
        User->>Header: App in background (hidden)
        Header->>Hook: sendHeartbeat() triggered
        Hook->>Fetch: navigator.sendBeacon (no await)
        Hook->>Hook: Return immediately
        Note over Hook: No timestamp updates, no broadcast
    end

    rect rgba(150, 200, 100, 0.5)
        Note over User,DB: Session-Based Presence Restore
        User->>Hook: Page refresh/session restore
        Hook->>Storage: Read session storage
        Storage-->>Hook: Restore data + age
        Hook->>Hook: Check: age < RESTORE_MAX_AGE_MS
        alt Valid restore
            Hook->>Hook: Restore from session (no recency gate)
            Hook->>Header: connectionState = restored status
        else Invalid
            Hook->>Hook: connectionState = offline
        end
    end

    rect rgba(200, 150, 100, 0.5)
        Note over User,Header: User Reconnection Action
        Header->>User: Display degraded/offline banner
        User->>Header: Click "Click here to reconnect"
        Header->>Hook: reconnectPresence()
        Hook->>Primary: POST /api/agents/heartbeat
        Primary-->>Hook: success
        Hook->>Header: Update connectionState = online
        Header->>User: Hide banner, show online badge
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 A heartbeat skips when signals fail,
But fear not—we display a tale!
Green badge gleams when all is well,
And "reconnect" breaks the offline spell.
Connection blooms, presence restored with grace! 💚

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.31% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main problem being fixed: a ghost call issue where an IVR repeats for 49 minutes due to a voicemail loop.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/voicemail-ghost-call

Comment @coderabbitai help to get the list of available commands and usage tips.

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policyjar Ready Ready Preview, Comment Apr 2, 2026 8:12pm

Request Review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a071ca6bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/lib/webhooks/handlers.ts Outdated
Comment on lines +3019 to +3025
const stillStuck = Boolean(callRow && !callRow.ended_at && !callRow.recording_url && !callRow.bridged_at)
if (stillStuck) {
console.error('[Telnyx] Voicemail watchdog fired - no recording after prompt, force hangup:', {
callId,
callControlId,
})
await hangupCall(callControlId).catch(() => {})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid terminating active voicemail recordings after 30s

The new watchdog in handleSpeakEnded treats !recording_url after 30 seconds as a stuck voicemail and force-hangs up the call, but recording_url is only populated later in handleRecordingSaved after the recording completes. In practice, any caller who speaks for longer than ~30 seconds (or any delayed recording.saved webhook) will be cut off mid-message even though record_start succeeded. This introduces a production regression for normal long voicemails.

Useful? React with 👍 / 👎.

coderabbitai[bot]
coderabbitai Bot previously requested changes Apr 2, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/__tests__/health-checks-lib.test.ts`:
- Around line 146-154: The test contains a dead branch: in the it.each block
testing checkTelnyxCredentialConnection the conditional if (!response)
expect(mockFetch).not.toHaveBeenCalled() can never run because every case
supplies a truthy response; remove this unreachable assertion or instead add a
test case with response: null if the intended behavior is to skip fetch when env
is missing. Update the test around checkTelnyxCredentialConnection (and the
related mockFetch setup) to either delete the !response expectation or add an
explicit null-response case to exercise the branch.

In `@src/app/api/agents/heartbeat/fallback/route.ts`:
- Around line 26-31: The handler currently maps a failed update to
agent_presence (the admin.from('agent_presence').update(...) call) to a 400 via
touchError and NextResponse.json; change this to return a 5xx (e.g., status:
500) so server-side write failures are classified as server errors. Update the
branch that checks touchError (the touchError variable) to return
NextResponse.json({ error: 'Internal error' }, { status: 500 }) (or another
appropriate 5xx) instead of 400 so retry/alerting logic treats it as a backend
failure.

In `@src/hooks/usePresence.ts`:
- Around line 259-275: The heartbeat code in usePresence.ts currently calls the
fallback endpoint for any non-OK response; change it to only call the fallback
when the primary fetch failed due to a transport error/timeout (i.e., the fetch
threw/caught and res is null) or optionally for explicit gateway 5xx statuses
(e.g., 502/503/504) if you want to allow selected server errors; update the
logic around the variables res, fallbackRes, heartbeatOk, and lastStatus so
that: if res?.ok set heartbeatOk=true, else set lastStatus=res?.status and only
invoke fetch('/api/agents/heartbeat/fallback', ...) when res is null (or when
res.status is in the configured gateway-5xx list); do not call the fallback for
ordinary 4xx/other 5xx responses.
- Around line 28-30: The code currently treats navigator.sendBeacon() success as
a confirmed heartbeat; change this so sharedLastApiHeartbeatAt and the online
flag are only updated after a real foreground fetch returns a 2xx response.
Specifically, stop setting sharedLastApiHeartbeatAt and online when sendBeacon()
returns true; instead perform a fetch (or await visibility-appropriate request)
to the heartbeat endpoint and only on response.ok update
sharedLastApiHeartbeatAt, set online = true, and dispatch the shared DOM event
(so other instances see a confirmed heartbeat). Ensure any fallback/queueing
from sendBeacon() does not mark the app online.

In `@src/lib/webhooks/handlers.ts`:
- Around line 2913-2916: The TTL guard is using the current webhook leg's
callControlId which can be the agent leg (causing agents to be hung up) instead
of targeting the inbound call; update the enforceUnbridgedCallTtl call in the
handleSpeakEnded path to use state.inbound_call_control_id when present (fall
back to callControlId if not) so the TTL check and subsequent hangup apply to
the inbound leg; adjust the call to enforceUnbridgedCallTtl (and any derived
log/context string such as `handleSpeakEnded:${String(state.action ||
'unknown')}`) to reference state.inbound_call_control_id where available.
- Around line 1512-1522: The voicemail-marker write must be treated as critical:
check the result (or catch errors) from updateCallRecordWithRetry when called
with match: { id: callId } and context `${context}:mark_voicemail_pending`, and
if the write failed (falsy result or caught error) abort the voicemail flow
instead of continuing; e.g., log the failure and return/stop so
recoverIvrToQueueOnSpeakEnd cannot rely on an unwritten breadcrumb (do not
proceed to play the voicemail prompt or set in-memory flags when
updateCallRecordWithRetry fails).
- Around line 3008-3034: The watchdog currently assumes absence of recording_url
means a stuck call and forces hangup in the setTimeout block (the stillStuck
check using ended_at, recording_url, bridged_at), which cuts off long
recordings; either extend the timeout to exceed the configured recording
max_length plus a buffer (e.g., replace the 30000 ms delay with maxLengthMs +
buffer) or change the logic to rely on explicit
recording-start/recording-complete signals (store a
recording_started/recording_completed flag in the calls row and check that
instead of recording_url) before calling hangupCall(callControlId); update the
stillStuck computation and timeout invocation in the watchdog setTimeout
accordingly so valid long voicemails are not terminated prematurely.

In `@supabase/migrations/20260401161000_presence_stale_2x_heartbeat.sql`:
- Around line 1-2: The migration comment and threshold in
presence_stale_2x_heartbeat.sql claim a "2x heartbeat (60s)" cutoff but the cron
that invokes the sweep (src/app/api/cron/sweep-presence/route.ts) is scheduled
every 2 minutes in vercel.json, so agents will be swept ~60–179s after last
heartbeat; update either the cron or the migration to match: either change
vercel.json to run the sweep every minute (cron "* * * * *") so the SQL's 2x
heartbeat semantics are accurate, or change the migration comment and threshold
logic in presence_stale_2x_heartbeat.sql to reflect a wider cutoff (e.g., 4x
heartbeat or explicit 120s+) and ensure any code reading that constant
(sweep-presence route) uses the same value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5623d576-1062-4543-a3d5-8caf3bb4030d

📥 Commits

Reviewing files that changed from the base of the PR and between c7a59d7 and 1a071ca.

📒 Files selected for processing (10)
  • src/__tests__/archive-recording.test.ts
  • src/__tests__/health-checks-lib.test.ts
  • src/__tests__/stripe-retry-lib.test.ts
  • src/app/api/agents/heartbeat/fallback/route.ts
  • src/app/api/agents/heartbeat/route.ts
  • src/components/layout/header.tsx
  • src/hooks/usePresence.ts
  • src/lib/presence/drain-queue-trigger.ts
  • src/lib/webhooks/handlers.ts
  • supabase/migrations/20260401161000_presence_stale_2x_heartbeat.sql

Comment thread src/__tests__/health-checks-lib.test.ts
Comment thread src/app/api/agents/heartbeat/fallback/route.ts Outdated
Comment thread src/hooks/usePresence.ts
Comment thread src/hooks/usePresence.ts Outdated
Comment thread src/lib/webhooks/handlers.ts Outdated
Comment thread src/lib/webhooks/handlers.ts
Comment thread src/lib/webhooks/handlers.ts Outdated
Comment thread supabase/migrations/20260401161000_presence_stale_2x_heartbeat.sql
@coygg coygg dismissed coderabbitai[bot]’s stale review April 2, 2026 20:11

Findings addressed or minor - merging.

@coygg coygg merged commit f78147c into master Apr 2, 2026
3 of 4 checks passed
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 427fdb7102

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/hooks/usePresence.ts
heartbeatOk = true
} else {
lastStatus = fallbackRes?.status ?? lastStatus
if (primaryTransportFailed) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Run fallback heartbeat on non-2xx primary responses

The fallback heartbeat is now gated behind primaryTransportFailed, so it only runs when fetch throws and is skipped for normal HTTP failures like 500/503. That removes the recovery path this code previously used for server-side primary endpoint errors, causing avoidable heartbeat failures and eventually forcing agents into degraded/offline state even when /api/agents/heartbeat/fallback could still succeed.

Useful? React with 👍 / 👎.

Comment thread src/hooks/usePresence.ts
parsed.status !== 'offline' &&
ageMs <= PRESENCE_RESTORE_MAX_AGE_MS &&
heartbeatAgeMs <= HEARTBEAT_RECENCY_MS
heartbeatAgeMs <= PRESENCE_RESTORE_MAX_AGE_MS
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep reload restore heartbeat window above 30s interval

The reload restore check now requires heartbeatAgeMs <= PRESENCE_RESTORE_MAX_AGE_MS (15s), but heartbeats are emitted every 30s. If the unload beacon is dropped or delayed, a normal healthy session can have a 15–30s old heartbeat and fail the restore gate, leaving the agent unexpectedly offline after refresh. The previous 60s recency window avoided this mismatch with the heartbeat cadence.

Useful? React with 👍 / 👎.

coygg added a commit that referenced this pull request Jun 1, 2026
* test: full audit coverage — archiveRecordingToStorage, health-checks, stripe-retry, rate-limit

- archiveRecordingToStorage: 21 tests covering happy path, extension detection,
  upload failure, signedUrl failure, call_update_failures insert path,
  failure insert error logging, and missing callRecord skip
- health-checks.ts: 23 tests (0% -> 100%) — checkTelnyxCCA, checkTelnyxCredentialConnection,
  checkSupabaseReachability, EXPECTED_WEBHOOK_URL derivation, all error/timeout paths
- stripe-retry.ts: 11 tests (41% -> 100%) — isStripeConnectionError, withStripeRetry
  happy path, connection error passthrough, 429 retry with retry-after cap, status fallback
- rate-limit.ts: 12 tests added (64% -> 93%) — isRateLimited, peekRateLimit,
  getRateLimitRetryAfterSecs, refundRateLimit, clearRateLimit

Total: 975 -> 1030 tests (+55), all passing

* fix: address CodeRabbit test quality findings on PR #336

* fix: strengthen test assertions per CodeRabbit review on PR #336

* fix: add missing assertions in archive-recording tests

* Harden agent presence heartbeat and surface offline state

* fix: address CodeRabbit findings on PR #337

* fix presence stale-check error handling and codrabbit test findings

* fix: final test assertion hardening per CodeRabbit

* Fix ghost voicemail loops and add hangup safeguards

* fix: resolve CodeRabbit findings for presence and voicemail flow
@coygg coygg deleted the fix/voicemail-ghost-call branch June 1, 2026 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant