feat(worker): surface child memory state on failures, drop no-op --optimize-for-size#89
Open
aymericcousaert wants to merge 2 commits into
Open
Conversation
…timize-for-size When a processing's child task aborts during a heavy run, the only diagnostic that previously made it to the run document was `child process exited with code N`. This is not actionable for operators tuning a deployment: they cannot tell whether the child hit a V8 heap ceiling, was OOM-killed by the cgroup, or died for another reason. Changes: - Print a one-line memory snapshot to stderr at child start and end (`task start mem rss=...MB heap=.../...MB ext=...MB`). Because the parent worker already captures the child's stderr via `buildErrorMessageFromStderr`, these lines are surfaced into the run's error message on non-zero exits without polluting successful runs. - Add `exitCodeHint(code)`: a small mapping from 134 (SIGABRT, the V8 `Check failed: (result.ptr) != nullptr` / `std::bad_alloc` signature) and 137 (SIGKILL, the cgroup OOM-kill signature) to a human-readable hint pointing at `NODE_OPTIONS=--max-old-space-size` and `mem_limit` respectively. The hint is appended to the run error message. - Drop `--optimize-for-size` from the Dockerfile worker CMD (introduced in cbadef8). The flag has no effect on the heavy work — each run is executed in a child process spawned without this flag (see worker/src/worker.ts) — and only a small slowdown effect on the orchestrator's GC. Memory tuning should be done at the container level via `NODE_OPTIONS` and `mem_limit`, which propagate to both parent and child. - Unit tests for `formatMemoryUsage` and `exitCodeHint`. Discovered while investigating a recurring nightly OOM on a customer deployment where the only visible error was "child process exited with code 134" — a 30s-readable hint would have saved a week.
- Mention both `mem_limit` (Docker Compose) and `resources.limits.memory`
(Kubernetes) so the hint is actionable on either platform.
- Replace `let errorMessage` + reassignment with a single `const` ternary
in worker.ts:iter (no behavior change).
The existing unit tests still pass — they assert `.toContain('mem_limit')`,
which remains true.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
While investigating a recurring nightly OOM on a customer deployment, the only diagnostic that made it back to the user was:
That message is not actionable for an operator tuning a deployment — they can't tell whether the child hit a V8 heap ceiling, was OOM-killed by the cgroup, or died for some other reason. It took several round-trips with the customer to collect host-level
dmesganddocker inspectoutput before the actual cause (cgroup memory limit) was confirmed.This PR adds three small, independent diagnostic improvements to the worker so the next operator hitting the same wall gets a self-explanatory error in the run document.
What changes
1. Memory snapshot on stderr at child start/end
Renders as
rss=512MB heap=420/480MB ext=20MB. Because the parent worker already captures the child's stderr viabuildErrorMessageFromStderr(and filters outworker:debug noise), these lines are surfaced into the run's error message on non-zero exits — but don't pollute successful runs (the parent only writes an error message on a thrown exception).2.
exitCodeHint(code)for OOM-typical exit codesAppended to the run error message in
worker.ts:iter. Existing user-facing messages in this file are already in French ('le traitement a été désactivé','le temps de traitement autorisé est épuisé'), so the new hint matches the convention. Both Docker Compose (mem_limit) and Kubernetes (resources.limits.memory) terminology is included so the hint is actionable on either platform.3. Drop
--optimize-for-sizefrom the worker Dockerfile CMDIntroduced in cbadef8 ("chore: worker runs with --optimize-for-size option", no body). The flag has no effect on the heavy per-run work — every processing is executed in a child process
spawn-ed without inheriting this flag (seeworker/src/worker.tsline ~211). The orchestrator parent only schedules and polls Mongo, so--optimize-for-sizeonly slightly slows its GC for no observable benefit.Memory tuning should be done at the container level via
NODE_OPTIONS=--max-old-space-size=...andmem_limit/resources.limits.memory, which do propagate to both the parent and the spawned child. The Dockerfile now includes a comment explaining this.What doesn't change
[spawned task stderr]prefix), so they show in worker logs but don't appear in the run document unless the run actually fails.buildErrorMessageFromStderris unchanged.How to verify
Risk
Low. The new code paths are:
console.errorcalls in the child startup (executes once per run regardless of outcome).The three changes are independently revertable. If the new
console.errorlines somehow break a downstream log parser, they can be reverted independently of the Dockerfile change.