PCv2: skip log collection in signal-triggered cleanup by sisuresh · Pull Request #388 · stellar/supercluster

sisuresh · 2026-05-15T21:19:29Z

Summary

I'm not sure if this is a foolproof solution, but it should be better than what we currently have.

Split cleanup into a signal-triggered path (e.g. Jenkins abort) and a normal path. Under signal-triggered cleanup we skip log collection and go straight to helm uninstall so we don't get SIGKILLed mid-collection and leak worker pods.
Observed in practice on a 1024-worker run aborted from Jenkins: Jenkins' default ~5s SoftKillWaitSeconds (and per-pod terminationGracePeriodSeconds) makes parallel log collection unwinnable on the abort path. Logs already captured inline by the failure handler remain on disk.

Copilot

Pull request overview

Splits the cleanup path in the PCv2 mission into a signal-triggered variant (skips log collection, goes straight to helm uninstall) and a normal variant (collects logs first), to avoid leaking worker pods when Jenkins aborts a run and SIGKILLs the process before log collection finishes.

Changes:

Add signalTriggered parameter to cleanup to distinguish abort vs normal cleanup.
Skip log collection on abort path; preserve existing log-collection behavior on normal failure/exit.
Update ProcessExit/CancelKeyPress handlers to pass true, and the in-mission failure/normal exit sites to pass false.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The PCv2 cleanup function was collecting logs from every worker pod before calling `helm uninstall`. With up to 1024 workers and serial per-pod log collection, the cleanup would routinely take minutes — far exceeding the ~5s Jenkins gives between SIGTERM and SIGKILL. Result: aborted PCv2 runs were leaving entire worker fleets running after the Jenkins job died, because `helm uninstall` never got a chance to run. Add a `signalTriggered` parameter to `cleanup`. Signal handlers (ProcessExit, CancelKeyPress) now pass `true`, which skips log collection and goes straight to `helm uninstall`. Normal / legitimate-failure callers pass `false`, preserving the existing logs-first behavior so debugging output isn't lost when pods would still be alive to read from anyway.

sisuresh requested review from Copilot and jayz22 May 15, 2026 21:19

Copilot started reviewing on behalf of sisuresh May 15, 2026 21:20 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

sisuresh force-pushed the pcv2-cleanup branch from 520e275 to 13100a0 Compare May 15, 2026 21:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCv2: skip log collection in signal-triggered cleanup#388

PCv2: skip log collection in signal-triggered cleanup#388
sisuresh wants to merge 1 commit into
stellar:mainfrom
sisuresh:pcv2-cleanup

sisuresh commented May 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sisuresh commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sisuresh commented May 15, 2026 •

edited

Loading