feat(worker): surface child memory state on failures, drop no-op --optimize-for-size by aymericcousaert · Pull Request #89 · data-fair/processings

aymericcousaert · 2026-05-25T08:48:12Z

Context

While investigating a recurring nightly OOM on a customer deployment, the only diagnostic that made it back to the user was:

child process exited with code 134

That message is not actionable for an operator tuning a deployment — they can't tell whether the child hit a V8 heap ceiling, was OOM-killed by the cgroup, or died for some other reason. It took several round-trips with the customer to collect host-level dmesg and docker inspect output before the actual cause (cgroup memory limit) was confirmed.

This PR adds three small, independent diagnostic improvements to the worker so the next operator hitting the same wall gets a self-explanatory error in the run document.

What changes

1. Memory snapshot on stderr at child start/end

// worker/src/task/index.ts
console.error(`task start mem ${formatMemoryUsage()}`)
// ...
console.error(`task end mem ${formatMemoryUsage()}`)

Renders as rss=512MB heap=420/480MB ext=20MB. Because the parent worker already captures the child's stderr via buildErrorMessageFromStderr (and filters out worker: debug noise), these lines are surfaced into the run's error message on non-zero exits — but don't pollute successful runs (the parent only writes an error message on a thrown exception).

Caveat: task end mem only fires on graceful exits (including JS-level errors that return from run() with exitCode = 1). On a hard V8 crash — std::bad_alloc / Check failed: (result.ptr) != nullptr, exit 134 — or a kernel SIGKILL (exit 137), only task start mem survives. That's still the more useful data point: it shows the starting heap, which combined with the V8 fatal stderr and the exit-code hint below is enough to diagnose and tune the limit. Peak memory remains invisible without a heap snapshot; capturing one would require an opt-in env var (left out of this PR to keep on-prem operations friction-free).

2. `exitCodeHint(code)` for OOM-typical exit codes

// worker/src/utils/worker-operations.ts
exitCodeHint(134) // → "SIGABRT, typique d'une allocation V8 impossible. Vérifier NODE_OPTIONS=--max-old-space-size et la limite mémoire du conteneur (mem_limit / resources.limits.memory)."
exitCodeHint(137) // → "SIGKILL, typique d'un OOM-kill par le noyau / cgroup. Augmenter la limite mémoire du conteneur (mem_limit / resources.limits.memory)."
exitCodeHint(1)   // → "" (unrelated)

Appended to the run error message in worker.ts:iter. Existing user-facing messages in this file are already in French ('le traitement a été désactivé', 'le temps de traitement autorisé est épuisé'), so the new hint matches the convention. Both Docker Compose (mem_limit) and Kubernetes (resources.limits.memory) terminology is included so the hint is actionable on either platform.

3. Drop `--optimize-for-size` from the worker Dockerfile CMD

Introduced in cbadef8 ("chore: worker runs with --optimize-for-size option", no body). The flag has no effect on the heavy per-run work — every processing is executed in a child process spawn-ed without inheriting this flag (see worker/src/worker.ts line ~211). The orchestrator parent only schedules and polls Mongo, so --optimize-for-size only slightly slows its GC for no observable benefit.

Memory tuning should be done at the container level via NODE_OPTIONS=--max-old-space-size=... and mem_limit / resources.limits.memory, which do propagate to both the parent and the spawned child. The Dockerfile now includes a comment explaining this.

What doesn't change

No new env vars, no new config keys, no new dependencies.
No behavior change for happy-path runs — the new lines go to stderr (with the existing [spawned task stderr] prefix), so they show in worker logs but don't appear in the run document unless the run actually fails.
The error-message filtering in buildErrorMessageFromStderr is unchanged.

How to verify

# Unit tests for the new helpers
npx playwright test tests/features/worker-utils/worker-operations.unit.spec.ts

# End-to-end: simulate a heap overrun
docker run -e NODE_OPTIONS='--max-old-space-size=64' <built-image> \
  # trigger a processing that allocates > 64 MB
# The run document should now show:
#   task start mem rss=XXMB heap=.../...MB ext=...MB
#   <FATAL ERROR: ... JavaScript heap out of memory>
#   le processus enfant a abandonné (SIGABRT, code 134) — typique d'une allocation V8 impossible. Vérifier NODE_OPTIONS=--max-old-space-size et la limite mémoire du conteneur (mem_limit / resources.limits.memory).

Risk

Low. The new code paths are:

Two console.error calls in the child startup (executes once per run regardless of outcome).
A pure string-concat in the parent's catch block (no mutation).
One line removed from the Dockerfile.

The three changes are independently revertable. If the new console.error lines somehow break a downstream log parser, they can be reverted independently of the Dockerfile change.

…timize-for-size When a processing's child task aborts during a heavy run, the only diagnostic that previously made it to the run document was `child process exited with code N`. This is not actionable for operators tuning a deployment: they cannot tell whether the child hit a V8 heap ceiling, was OOM-killed by the cgroup, or died for another reason. Changes: - Print a one-line memory snapshot to stderr at child start and end (`task start mem rss=...MB heap=.../...MB ext=...MB`). Because the parent worker already captures the child's stderr via `buildErrorMessageFromStderr`, these lines are surfaced into the run's error message on non-zero exits without polluting successful runs. - Add `exitCodeHint(code)`: a small mapping from 134 (SIGABRT, the V8 `Check failed: (result.ptr) != nullptr` / `std::bad_alloc` signature) and 137 (SIGKILL, the cgroup OOM-kill signature) to a human-readable hint pointing at `NODE_OPTIONS=--max-old-space-size` and `mem_limit` respectively. The hint is appended to the run error message. - Drop `--optimize-for-size` from the Dockerfile worker CMD (introduced in cbadef8). The flag has no effect on the heavy work — each run is executed in a child process spawned without this flag (see worker/src/worker.ts) — and only a small slowdown effect on the orchestrator's GC. Memory tuning should be done at the container level via `NODE_OPTIONS` and `mem_limit`, which propagate to both parent and child. - Unit tests for `formatMemoryUsage` and `exitCodeHint`. Discovered while investigating a recurring nightly OOM on a customer deployment where the only visible error was "child process exited with code 134" — a 30s-readable hint would have saved a week.

- Mention both `mem_limit` (Docker Compose) and `resources.limits.memory` (Kubernetes) so the hint is actionable on either platform. - Replace `let errorMessage` + reassignment with a single `const` ternary in worker.ts:iter (no behavior change). The existing unit tests still pass — they assert `.toContain('mem_limit')`, which remains true.

github-actions Bot added the feature label May 25, 2026

github-actions Bot added feature and removed feature labels May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(worker): surface child memory state on failures, drop no-op --optimize-for-size#89

feat(worker): surface child memory state on failures, drop no-op --optimize-for-size#89
aymericcousaert wants to merge 2 commits into
data-fair:masterfrom
aymericcousaert:feat/worker-memory-diagnostics

aymericcousaert commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

aymericcousaert commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What changes

1. Memory snapshot on stderr at child start/end

2. exitCodeHint(code) for OOM-typical exit codes

3. Drop --optimize-for-size from the worker Dockerfile CMD

What doesn't change

How to verify

Risk

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aymericcousaert commented May 25, 2026 •

edited

Loading

2. `exitCodeHint(code)` for OOM-typical exit codes

3. Drop `--optimize-for-size` from the worker Dockerfile CMD