PCv2: skip log collection in signal-triggered cleanup#388
Open
sisuresh wants to merge 1 commit into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Splits the cleanup path in the PCv2 mission into a signal-triggered variant (skips log collection, goes straight to helm uninstall) and a normal variant (collects logs first), to avoid leaking worker pods when Jenkins aborts a run and SIGKILLs the process before log collection finishes.
Changes:
- Add
signalTriggeredparameter tocleanupto distinguish abort vs normal cleanup. - Skip log collection on abort path; preserve existing log-collection behavior on normal failure/exit.
- Update
ProcessExit/CancelKeyPresshandlers to passtrue, and the in-mission failure/normal exit sites to passfalse.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The PCv2 cleanup function was collecting logs from every worker pod before calling `helm uninstall`. With up to 1024 workers and serial per-pod log collection, the cleanup would routinely take minutes — far exceeding the ~5s Jenkins gives between SIGTERM and SIGKILL. Result: aborted PCv2 runs were leaving entire worker fleets running after the Jenkins job died, because `helm uninstall` never got a chance to run. Add a `signalTriggered` parameter to `cleanup`. Signal handlers (ProcessExit, CancelKeyPress) now pass `true`, which skips log collection and goes straight to `helm uninstall`. Normal / legitimate-failure callers pass `false`, preserving the existing logs-first behavior so debugging output isn't lost when pods would still be alive to read from anyway.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
I'm not sure if this is a foolproof solution, but it should be better than what we currently have.
cleanupinto a signal-triggered path (e.g. Jenkins abort) and a normal path. Under signal-triggered cleanup we skip log collection and go straight tohelm uninstallso we don't get SIGKILLed mid-collection and leak worker pods.SoftKillWaitSeconds(and per-podterminationGracePeriodSeconds) makes parallel log collection unwinnable on the abort path. Logs already captured inline by the failure handler remain on disk.