Skip to content

fix flaky tests boxel-cli: realm push delete #4728

Open
habdelra wants to merge 5 commits intomainfrom
fix-boxel-cli-delete-timeout-flake
Open

fix flaky tests boxel-cli: realm push delete #4728
habdelra wants to merge 5 commits intomainfrom
fix-boxel-cli-delete-timeout-flake

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

@habdelra habdelra commented May 8, 2026

Summary

  • harden boxel realm push --delete against a flaky delete-response hang in Boxel CLI integration tests
  • skip realm-managed index.json and realm.json when computing remote-only delete candidates
  • add delete diagnostics and a post-timeout probe so follow-up investigation still captures which file timed out and whether the delete had already applied

Root Cause

The flaky tests/integration/realm-push.test.ts failure was not caused by deleting realm-managed artifacts. In the reproduced failure, deleting the real remote-only file (remove.gts) sometimes applied on the server but the DELETE response never completed before the client timed out. That left the CLI treating the push as failed even though the file was already gone.

Validation

  • pnpm lint:js in packages/boxel-cli
  • pnpm exec vitest run tests/integration/realm-push.test.ts --pool=forks --poolOptions.forks.singleFork --testNamePattern "push with --delete removes remote-only files (delete alone does not checkpoint)"
  • pnpm exec vitest run tests/integration/realm-push.test.ts --pool=forks --poolOptions.forks.singleFork

habdelra and others added 5 commits May 8, 2026 09:33
mise's task supervisor reparents long-running task scripts (the
`mise-tasks/services/*` scripts and the worker's `ts-node --transpileOnly
worker` child) to init, so a PPID-based `kill_tree` alone can't reach
them when the trap fires on Ctrl-C / EXIT. Result: services keep running
after `mise run dev` (or `pnpm start:all`) exits, worker-manager keeps
port 4210 bound, and the next dev run fails to start with EADDRINUSE.

`dev-all` got this fix in #4704; `dev` was missed. This ports the
matching `set -m` + `kill_tree` + `cleanup` trap (plus the regex-escaped
`pkill -f` sweep scoped to `$REPO_ROOT`) into `dev`, adapted for the
simpler structure (no separate HOST_PID — host runs inside SAT phase 1
here).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… children on cleanup

The `--transpileOnly worker` regex only catches worker / worker-manager
invocations; the realm-server (`--transpileOnly main`), test-realms
(also `main`), prerender (`prerender/prerender-server`), and prerender-
manager (`prerender/manager-server`) ts-node children were missed.
None of the wrappers `exec` ts-node, so killing the bash wrapper alone
leaves the ts-node grandchild reparented to init with its port still
bound — the same EADDRINUSE-on-next-run failure mode this cleanup is
meant to prevent, just on 4201 / 4202 / 4221 / 4222 instead of 4210.

Broaden the second pkill pattern to `--transpileOnly (worker|main|
prerender)` so all five service entrypoints are signalled.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workers can come up before the realm-server has bound its port (by
design in dev — the worker stack starts in parallel with realm-server
and any from-scratch-index jobs queued from a previous run get picked
up immediately on startup). Each ECONNREFUSED currently logs a full
TypeError + undici stack + nested cause, drowning the boot logs in
noise that's expected and self-resolving.

Detect ECONNREFUSED on the err or its cause and log a single line
naming the job, type, and unreachable target. All other errors keep
their existing full-stack treatment, and Sentry still captures the
exception in deployed environments.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cleanup infrastructure (REPO_ROOT computation, kill_tree, the
absolute-path pkill sweep, and the explanatory comments around them)
was duplicated verbatim across `dev` and `dev-all`. Move the parts
that don't vary between tasks — `kill_tree()` and a new
`sweep_orphaned_services()` helper — into the file both tasks already
source. Each task is left with only the bits that genuinely differ
(SAT_PID handling in `dev`; HOST_PID + readiness wait in `dev-all`).

Also prepend the absolute repo and realm-server `node_modules/.bin`
paths to PATH inside `dev-common.sh`. The previous relative
`./node_modules/.bin` entry meant binaries resolved through PATH
could carry relative argv[0] in spawned children, weakening the
absolute-path anchor in the cleanup sweep. `dev-all` already had this
prepend inline; centralizing it makes the assumption explicit and
applies it to `dev` too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra habdelra marked this pull request as ready for review May 8, 2026 14:43
@habdelra habdelra requested a review from Copilot May 8, 2026 14:43
@habdelra habdelra changed the title [codex] boxel-cli: harden realm push delete flake fix flaky tests boxel-cli: realm push delete May 8, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens boxel realm push --delete behavior in the Boxel CLI to reduce flaky failures when DELETE requests hang, while also adjusting dev-task process cleanup and queue-runner logging.

Changes:

  • Add a timeout + post-timeout probe for remote deletes in RealmSyncBase.deleteFile() to continue when the delete actually applied server-side.
  • Exclude realm-managed index.json and realm.json from remote-only delete candidates and make the delete plan deterministic.
  • Improve dev-task cleanup by centralizing process-sweeping helpers, and reduce queue-runner log noise for transient ECONNREFUSED errors.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
packages/postgres/pg-queue.ts Adjusts job failure logging to reduce stack-trace noise for ECONNREFUSED.
packages/boxel-cli/src/lib/realm-sync-base.ts Adds DELETE timeout handling, diagnostics, and a probe to determine whether a timed-out delete actually applied.
packages/boxel-cli/src/commands/realm/push.ts Excludes realm-managed artifacts from --delete candidates and switches to a sorted, sequential delete plan.
mise-tasks/lib/dev-common.sh Centralizes PATH bootstrapping and adds kill_tree + orphaned-service sweeping logic.
mise-tasks/dev-all Sources shared dev-common helpers and uses the new sweep function during cleanup.
mise-tasks/dev Switches to bash, adds job-control + cleanup that kills the tracked supervisor PID and sweeps orphans.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +595 to +598
let elapsedMs = Date.now() - startedAt;
console.error(
` Delete request failed after ${elapsedMs}ms: ${relativePath}`,
);
Comment on lines +744 to +747
let target =
cause?.address && cause?.port
? `${cause.address}:${cause.port}`
: 'upstream';
Comment on lines +748 to +756
console.warn(
`job ${jobToRun.id} (${jobToRun.job_type}) failed: ECONNREFUSED to ${target}`,
);
} else {
console.error(
`Error running job ${jobToRun.id}: jobType=${
jobToRun.job_type
} args=${JSON.stringify(jobToRun.args)}`,
err,
Comment on lines +23 to +27
# Sweep orphaned mise services scoped to this checkout. mise's task supervisor
# reparents long-running task scripts to init, so a tree-walk from a known
# pid can't reach them via PPID. SIGTERM first for anything that listens,
# brief grace, then SIGKILL — belt-and-suspenders for processes that don't
# respond to TERM.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Host Test Results

    1 files      1 suites   1h 48m 0s ⏱️
2 643 tests 2 628 ✅ 15 💤 0 ❌
2 662 runs  2 647 ✅ 15 💤 0 ❌

Results for commit 7688817.

Realm Server Test Results

    1 files      1 suites   16m 14s ⏱️
1 297 tests 1 297 ✅ 0 💤 0 ❌
1 376 runs  1 376 ✅ 0 💤 0 ❌

Results for commit 7688817.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants