test(infra): migrate integration tests to explicit CleanupGuard cleanup#515
Merged
Conversation
CleanupGuard::drop currently calls .join() with no timeout. When test cleanup hangs (DB pool contention, slow DELETE, lock wait, etc.), the test process holds until nextest's 120s slow-timeout kills it. Add a 30-second polling join: if cleanup completes within 30s, join normally. Otherwise, log to stderr and return — the cleanup thread continues detached but the test process is no longer blocked. This is a stopgap. The proper fix (explicit .cleanup().await) is in follow-up commits. Refs #509 (same anti-pattern, different test set)
Tests should call .cleanup().await at the end of the body instead of relying on Drop. The new method consumes self, runs all cleanup actions on the caller's tokio runtime, and leaves Drop as a no-op. The Drop fallback now also prints a 'dropped with N pending actions' warning so the migration progress is visible. After all tests are migrated and CI is green for ~2 weeks, the warning will be replaced with a panic in a follow-up commit. Refs spec docs/superpowers/specs/2026-04-11-security-audit-followups-design.md
10 test sites migrated. Includes the 2 known-flaky tests that triggered the CleanupGuard CI fix: - test_globally_banned_user_cannot_join_via_discovery - test_channel_limit
6 test sites migrated. Includes the known-flaky test: - test_custom_status_with_expiry_persists - test_custom_status_clear_in_database
24 test sites migrated.
17 test sites migrated.
12 test sites migrated.
10 test sites migrated.
10 test sites migrated.
…p().await 8 test sites migrated.
6 test sites migrated.
….await 6 test sites migrated.
…().await 6 test sites migrated.
6 test sites migrated.
5 test sites migrated.
…await 5 test sites migrated.
…wait 3 test sites migrated.
…eanup().await 2 test sites migrated.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Eliminates the recurring CI flake where Rust integration tests intermittently time out at nextest's 120s
slow-timeout. Three different tests (later 4) hit this on PRs #512, #513, and #514 — always reported as test #900/900 because nextest reports completion in finish-time order.Root cause:
CleanupGuard::drop(server/tests/integration/helpers/mod.rs) spawned a thread + new tokio runtime +.join().expect()with no timeout. When cleanup hangs (DB pool contention, slow DELETE, lock wait), the test process holds for 120s and gets killed.Issue #509 documented the same anti-pattern in the setup tests with the same root cause analysis.
Layers
Layer 1 — Stopgap (1 commit, helpers/mod.rs): Bounded join in
CleanupGuard::drop. 30-second polling join, then detach. Immediately removes the failure mode.Layer 2 — Proper fix (1 commit + 16 migration commits, ~136 sites across 16 files): Added
pub async fn cleanup(mut self)that consumes the guard and runs cleanup actions on the caller's existing tokio runtime. Migrated all 136 test sites to callguard.cleanup().await;before returning. The Drop fallback stays as a permanent safety net for tests that panic before reaching cleanup (RAII semantics) and prints a "dropped with N pending actions" warning so unmigrated tests are visible.Layer 3 — Deferred: After ~2 weeks of clean main-branch CI with zero "dropped with N pending actions" warnings, a follow-up commit will replace the Drop fallback's eprintln with
panic!()to permanently enforce the contract. Tracked in: #516.Migration coverage
Test plan
cargo build --tests -p vc-servercleancargo clippy -p vc-server --lib -- -D warningscleancargo nextest run -p vc-server --lib→ 430/430 passingcargo fmt --checkcleangrep -c "let mut guard"equalsgrep -c "guard.cleanup().await"across all 16 files (136/136)Refs
🤖 Generated with Claude Code