fix flaky tests boxel-cli: realm push delete #4728
Open
Conversation
mise's task supervisor reparents long-running task scripts (the `mise-tasks/services/*` scripts and the worker's `ts-node --transpileOnly worker` child) to init, so a PPID-based `kill_tree` alone can't reach them when the trap fires on Ctrl-C / EXIT. Result: services keep running after `mise run dev` (or `pnpm start:all`) exits, worker-manager keeps port 4210 bound, and the next dev run fails to start with EADDRINUSE. `dev-all` got this fix in #4704; `dev` was missed. This ports the matching `set -m` + `kill_tree` + `cleanup` trap (plus the regex-escaped `pkill -f` sweep scoped to `$REPO_ROOT`) into `dev`, adapted for the simpler structure (no separate HOST_PID — host runs inside SAT phase 1 here). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… children on cleanup The `--transpileOnly worker` regex only catches worker / worker-manager invocations; the realm-server (`--transpileOnly main`), test-realms (also `main`), prerender (`prerender/prerender-server`), and prerender- manager (`prerender/manager-server`) ts-node children were missed. None of the wrappers `exec` ts-node, so killing the bash wrapper alone leaves the ts-node grandchild reparented to init with its port still bound — the same EADDRINUSE-on-next-run failure mode this cleanup is meant to prevent, just on 4201 / 4202 / 4221 / 4222 instead of 4210. Broaden the second pkill pattern to `--transpileOnly (worker|main| prerender)` so all five service entrypoints are signalled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workers can come up before the realm-server has bound its port (by design in dev — the worker stack starts in parallel with realm-server and any from-scratch-index jobs queued from a previous run get picked up immediately on startup). Each ECONNREFUSED currently logs a full TypeError + undici stack + nested cause, drowning the boot logs in noise that's expected and self-resolving. Detect ECONNREFUSED on the err or its cause and log a single line naming the job, type, and unreachable target. All other errors keep their existing full-stack treatment, and Sentry still captures the exception in deployed environments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cleanup infrastructure (REPO_ROOT computation, kill_tree, the absolute-path pkill sweep, and the explanatory comments around them) was duplicated verbatim across `dev` and `dev-all`. Move the parts that don't vary between tasks — `kill_tree()` and a new `sweep_orphaned_services()` helper — into the file both tasks already source. Each task is left with only the bits that genuinely differ (SAT_PID handling in `dev`; HOST_PID + readiness wait in `dev-all`). Also prepend the absolute repo and realm-server `node_modules/.bin` paths to PATH inside `dev-common.sh`. The previous relative `./node_modules/.bin` entry meant binaries resolved through PATH could carry relative argv[0] in spawned children, weakening the absolute-path anchor in the cleanup sweep. `dev-all` already had this prepend inline; centralizing it makes the assumption explicit and applies it to `dev` too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens boxel realm push --delete behavior in the Boxel CLI to reduce flaky failures when DELETE requests hang, while also adjusting dev-task process cleanup and queue-runner logging.
Changes:
- Add a timeout + post-timeout probe for remote deletes in
RealmSyncBase.deleteFile()to continue when the delete actually applied server-side. - Exclude realm-managed
index.jsonandrealm.jsonfrom remote-only delete candidates and make the delete plan deterministic. - Improve dev-task cleanup by centralizing process-sweeping helpers, and reduce queue-runner log noise for transient
ECONNREFUSEDerrors.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/postgres/pg-queue.ts | Adjusts job failure logging to reduce stack-trace noise for ECONNREFUSED. |
| packages/boxel-cli/src/lib/realm-sync-base.ts | Adds DELETE timeout handling, diagnostics, and a probe to determine whether a timed-out delete actually applied. |
| packages/boxel-cli/src/commands/realm/push.ts | Excludes realm-managed artifacts from --delete candidates and switches to a sorted, sequential delete plan. |
| mise-tasks/lib/dev-common.sh | Centralizes PATH bootstrapping and adds kill_tree + orphaned-service sweeping logic. |
| mise-tasks/dev-all | Sources shared dev-common helpers and uses the new sweep function during cleanup. |
| mise-tasks/dev | Switches to bash, adds job-control + cleanup that kills the tracked supervisor PID and sweeps orphans. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+595
to
+598
| let elapsedMs = Date.now() - startedAt; | ||
| console.error( | ||
| ` Delete request failed after ${elapsedMs}ms: ${relativePath}`, | ||
| ); |
Comment on lines
+744
to
+747
| let target = | ||
| cause?.address && cause?.port | ||
| ? `${cause.address}:${cause.port}` | ||
| : 'upstream'; |
Comment on lines
+748
to
+756
| console.warn( | ||
| `job ${jobToRun.id} (${jobToRun.job_type}) failed: ECONNREFUSED to ${target}`, | ||
| ); | ||
| } else { | ||
| console.error( | ||
| `Error running job ${jobToRun.id}: jobType=${ | ||
| jobToRun.job_type | ||
| } args=${JSON.stringify(jobToRun.args)}`, | ||
| err, |
Comment on lines
+23
to
+27
| # Sweep orphaned mise services scoped to this checkout. mise's task supervisor | ||
| # reparents long-running task scripts to init, so a tree-walk from a known | ||
| # pid can't reach them via PPID. SIGTERM first for anything that listens, | ||
| # brief grace, then SIGKILL — belt-and-suspenders for processes that don't | ||
| # respond to TERM. |
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
boxel realm push --deleteagainst a flaky delete-response hang in Boxel CLI integration testsindex.jsonandrealm.jsonwhen computing remote-only delete candidatesRoot Cause
The flaky
tests/integration/realm-push.test.tsfailure was not caused by deleting realm-managed artifacts. In the reproduced failure, deleting the real remote-only file (remove.gts) sometimes applied on the server but the DELETE response never completed before the client timed out. That left the CLI treating the push as failed even though the file was already gone.Validation
pnpm lint:jsinpackages/boxel-clipnpm exec vitest run tests/integration/realm-push.test.ts --pool=forks --poolOptions.forks.singleFork --testNamePattern "push with --delete removes remote-only files (delete alone does not checkpoint)"pnpm exec vitest run tests/integration/realm-push.test.ts --pool=forks --poolOptions.forks.singleFork