Skip to content

feat(worker): surface child memory state on failures, drop no-op --optimize-for-size#89

Open
aymericcousaert wants to merge 2 commits into
data-fair:masterfrom
aymericcousaert:feat/worker-memory-diagnostics
Open

feat(worker): surface child memory state on failures, drop no-op --optimize-for-size#89
aymericcousaert wants to merge 2 commits into
data-fair:masterfrom
aymericcousaert:feat/worker-memory-diagnostics

Conversation

@aymericcousaert
Copy link
Copy Markdown

@aymericcousaert aymericcousaert commented May 25, 2026

Context

While investigating a recurring nightly OOM on a customer deployment, the only diagnostic that made it back to the user was:

child process exited with code 134

That message is not actionable for an operator tuning a deployment — they can't tell whether the child hit a V8 heap ceiling, was OOM-killed by the cgroup, or died for some other reason. It took several round-trips with the customer to collect host-level dmesg and docker inspect output before the actual cause (cgroup memory limit) was confirmed.

This PR adds three small, independent diagnostic improvements to the worker so the next operator hitting the same wall gets a self-explanatory error in the run document.

What changes

1. Memory snapshot on stderr at child start/end

// worker/src/task/index.ts
console.error(`task start mem ${formatMemoryUsage()}`)
// ...
console.error(`task end mem ${formatMemoryUsage()}`)

Renders as rss=512MB heap=420/480MB ext=20MB. Because the parent worker already captures the child's stderr via buildErrorMessageFromStderr (and filters out worker: debug noise), these lines are surfaced into the run's error message on non-zero exits — but don't pollute successful runs (the parent only writes an error message on a thrown exception).

Caveat: task end mem only fires on graceful exits (including JS-level errors that return from run() with exitCode = 1). On a hard V8 crash — std::bad_alloc / Check failed: (result.ptr) != nullptr, exit 134 — or a kernel SIGKILL (exit 137), only task start mem survives. That's still the more useful data point: it shows the starting heap, which combined with the V8 fatal stderr and the exit-code hint below is enough to diagnose and tune the limit. Peak memory remains invisible without a heap snapshot; capturing one would require an opt-in env var (left out of this PR to keep on-prem operations friction-free).

2. exitCodeHint(code) for OOM-typical exit codes

// worker/src/utils/worker-operations.ts
exitCodeHint(134) // → "SIGABRT, typique d'une allocation V8 impossible. Vérifier NODE_OPTIONS=--max-old-space-size et la limite mémoire du conteneur (mem_limit / resources.limits.memory)."
exitCodeHint(137) // → "SIGKILL, typique d'un OOM-kill par le noyau / cgroup. Augmenter la limite mémoire du conteneur (mem_limit / resources.limits.memory)."
exitCodeHint(1)   // → "" (unrelated)

Appended to the run error message in worker.ts:iter. Existing user-facing messages in this file are already in French ('le traitement a été désactivé', 'le temps de traitement autorisé est épuisé'), so the new hint matches the convention. Both Docker Compose (mem_limit) and Kubernetes (resources.limits.memory) terminology is included so the hint is actionable on either platform.

3. Drop --optimize-for-size from the worker Dockerfile CMD

Introduced in cbadef8 ("chore: worker runs with --optimize-for-size option", no body). The flag has no effect on the heavy per-run work — every processing is executed in a child process spawn-ed without inheriting this flag (see worker/src/worker.ts line ~211). The orchestrator parent only schedules and polls Mongo, so --optimize-for-size only slightly slows its GC for no observable benefit.

Memory tuning should be done at the container level via NODE_OPTIONS=--max-old-space-size=... and mem_limit / resources.limits.memory, which do propagate to both the parent and the spawned child. The Dockerfile now includes a comment explaining this.

What doesn't change

  • No new env vars, no new config keys, no new dependencies.
  • No behavior change for happy-path runs — the new lines go to stderr (with the existing [spawned task stderr] prefix), so they show in worker logs but don't appear in the run document unless the run actually fails.
  • The error-message filtering in buildErrorMessageFromStderr is unchanged.

How to verify

# Unit tests for the new helpers
npx playwright test tests/features/worker-utils/worker-operations.unit.spec.ts

# End-to-end: simulate a heap overrun
docker run -e NODE_OPTIONS='--max-old-space-size=64' <built-image> \
  # trigger a processing that allocates > 64 MB
# The run document should now show:
#   task start mem rss=XXMB heap=.../...MB ext=...MB
#   <FATAL ERROR: ... JavaScript heap out of memory>
#   le processus enfant a abandonné (SIGABRT, code 134) — typique d'une allocation V8 impossible. Vérifier NODE_OPTIONS=--max-old-space-size et la limite mémoire du conteneur (mem_limit / resources.limits.memory).

Risk

Low. The new code paths are:

  • Two console.error calls in the child startup (executes once per run regardless of outcome).
  • A pure string-concat in the parent's catch block (no mutation).
  • One line removed from the Dockerfile.

The three changes are independently revertable. If the new console.error lines somehow break a downstream log parser, they can be reverted independently of the Dockerfile change.

…timize-for-size

When a processing's child task aborts during a heavy run, the only
diagnostic that previously made it to the run document was
`child process exited with code N`. This is not actionable for
operators tuning a deployment: they cannot tell whether the child
hit a V8 heap ceiling, was OOM-killed by the cgroup, or died for
another reason.

Changes:

- Print a one-line memory snapshot to stderr at child start and end
  (`task start mem rss=...MB heap=.../...MB ext=...MB`). Because the
  parent worker already captures the child's stderr via
  `buildErrorMessageFromStderr`, these lines are surfaced into the
  run's error message on non-zero exits without polluting successful
  runs.
- Add `exitCodeHint(code)`: a small mapping from 134 (SIGABRT, the
  V8 `Check failed: (result.ptr) != nullptr` / `std::bad_alloc`
  signature) and 137 (SIGKILL, the cgroup OOM-kill signature) to a
  human-readable hint pointing at `NODE_OPTIONS=--max-old-space-size`
  and `mem_limit` respectively. The hint is appended to the run
  error message.
- Drop `--optimize-for-size` from the Dockerfile worker CMD
  (introduced in cbadef8). The flag has no effect on the heavy
  work — each run is executed in a child process spawned without
  this flag (see worker/src/worker.ts) — and only a small slowdown
  effect on the orchestrator's GC. Memory tuning should be done at
  the container level via `NODE_OPTIONS` and `mem_limit`, which
  propagate to both parent and child.
- Unit tests for `formatMemoryUsage` and `exitCodeHint`.

Discovered while investigating a recurring nightly OOM on a customer
deployment where the only visible error was "child process exited
with code 134" — a 30s-readable hint would have saved a week.
- Mention both `mem_limit` (Docker Compose) and `resources.limits.memory`
  (Kubernetes) so the hint is actionable on either platform.
- Replace `let errorMessage` + reassignment with a single `const` ternary
  in worker.ts:iter (no behavior change).

The existing unit tests still pass — they assert `.toContain('mem_limit')`,
which remains true.
@github-actions github-actions Bot added feature and removed feature labels May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant