Skip to content

fix: reclassify Apollo network-error log severity#58

Draft
TaprootFreak wants to merge 1 commit into
developfrom
fix/apollo-network-error-log-severity
Draft

fix: reclassify Apollo network-error log severity#58
TaprootFreak wants to merge 1 commit into
developfrom
fix/apollo-network-error-log-severity

Conversation

@TaprootFreak

@TaprootFreak TaprootFreak commented Jun 1, 2026

Copy link
Copy Markdown

Summary

Mirror of d-EURO/api#117 per the
shared-codebase convention. Reclassify recoverable Apollo network errors
from error to warn, drop log payload fields the active formatter
silently discards, and refresh the fallback window after expiry.

Tracked in DFXServer/server#278.

Investigation

Sampled 2026-06-01 ~11:00 CEST on dfxprd (Loki):

Container error/60m error/24h
juicedollar-jdm-api-11 19 357
juicedollar-jdt-api-11 7 250

Both PRD containers run with LOG_LEVEL=warn (compose), so the existing
info-level breadcrumbs ([Ponder] Network error detected, activating fallback) were filtered out before they ever reached Loki. That's why
the ApiApolloConfig events show up as 100% error severity in the
dashboards even though the runtime path is the recoverable one.

GraphQL-errors path: 0/24h on dfxprd — confirms the noise is purely
from the network-error path.

Two latent issues fixed alongside (same as in d-EURO/api#117):

  1. Metadata silently dropped. Winston formatter in api.main.ts uses
    only info.message; the { message, name, stack } second arg never
    shipped. Inlining networkError.message into the line keeps the
    signal that was being lost.
  2. activateFallback() only armed once per process lifetime. Guard
    if (!fallbackUntil) stayed truthy after the first activation, so
    the "Switching to fallback for 10min" log fired once per container
    boot. Refreshed to re-arm when the window has passed.

Change

Byte-identical to the d-EURO sibling PR. Both apollo files are now in
sync (originally diverged by one line — the && CONFIG.indexerFallback
guard — which is now consolidated into both sides, benign in d-EURO
where indexerFallback has a config-level default).

Behaviour matrix

Scenario severity retries?
Primary fails, fallback configured & different URL, first time warn yes, via fallback
Primary fails, fallback configured & different URL, already on fallback error no (propagate)
Primary fails, no fallback URL configured (current jdm/jdt PRD) error no
GraphQL error error no (unchanged)

Behaviour change for the no-fallback case

The PRD compose for jdm/jdt does not set
CONFIG_INDEXER_FALLBACK_URL (intentional: there is no second indexer
deployment). With the previous code, network errors entered the recovery
branch anyway and called forward(operation) — a same-URL retry that
cannot recover anything meaningful, since both attempts hit the same
endpoint in the same JS tick. This PR drops that retry and propagates
the error to the caller instead.

  • Error counts in Loki stay roughly the same (~25/h combined) because
    these are real failures, not noise being mis-classified.
  • Severity stays at error — the no-fallback branch logs logger.error,
    not warn. That's the correct semantic: nothing to retry with means
    it's a real failure for the client.
  • Clients see the error one round-trip earlier. The dapp and bots have
    their own polling/retry cycles, so the user-visible effect is minimal.

Once a fallback indexer is provisioned (CONFIG_INDEXER_FALLBACK_URL
becomes a non-empty string different from CONFIG_INDEXER_URL), the
recovery branch lights up automatically: warn + URL switch + retry, just
like d-EURO PRD does today.

Expected post-deploy effect on dfxprd

  • ApiApolloConfig error-level lines: stays ~607/d combined
    (jdm ~357, jdt ~250) — these are real failures that can't be hidden
    behind a non-existent fallback. They retain visibility on the
    error-rate panel.
  • One round-trip less per failure event (no useless same-URL retry).
  • Info-level [Ponder] Network error detected … breadcrumbs gone
    entirely (they were filtered by LOG_LEVEL=warn anyway, so no
    observable change).

A real noise reduction for JD requires either provisioning a separate
fallback indexer endpoint, or accepting these as real-failure signals
(my read).

Test plan

  • yarn build clean (verified locally on the branch HEAD)
  • yarn lint clean (verified locally on the branch HEAD)
  • npx prettier --check api.apollo.config.ts clean
  • After deploy to dfxprd: error-rate stays in the same band but
    no longer doubles up with retry round-trips; CI logs from the
    dapp/bots show no regression.

Drop noise from the recoverable-retry path:

- Use logger.warn (not logger.error) when the primary indexer fails and
  the fallback is about to be engaged; reserve logger.error for cases
  where no fallback is configured or the fallback itself failed.
- Drop the {message, name, stack} metadata payload — the Winston
  formatter in api.main.ts uses only info.message, so it never reached
  Loki anyway. Inline the message into the log line for actual signal.
- Collapse the redundant info-level breadcrumbs ('Network error
  detected' / '503 Service Unavailable') into the single warn line.
- Refresh the fallback window after expiry instead of arming it once
  per process lifetime, so a sustained outage keeps the fallback active.

Behaviour change for the no-fallback case (jdm/jdt PRD have no
CONFIG_INDEXER_FALLBACK_URL set): the same-URL retry via forward() is
dropped because it cannot recover anything. Errors now propagate to
the caller without an extra round-trip.

Mirrors the parallel change in d-EURO/api per the shared codebase
convention.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant