Skip to content

PCv2: skip log collection in signal-triggered cleanup#388

Open
sisuresh wants to merge 1 commit into
stellar:mainfrom
sisuresh:pcv2-cleanup
Open

PCv2: skip log collection in signal-triggered cleanup#388
sisuresh wants to merge 1 commit into
stellar:mainfrom
sisuresh:pcv2-cleanup

Conversation

@sisuresh
Copy link
Copy Markdown
Contributor

@sisuresh sisuresh commented May 15, 2026

Summary

I'm not sure if this is a foolproof solution, but it should be better than what we currently have.

  • Split cleanup into a signal-triggered path (e.g. Jenkins abort) and a normal path. Under signal-triggered cleanup we skip log collection and go straight to helm uninstall so we don't get SIGKILLed mid-collection and leak worker pods.
  • Observed in practice on a 1024-worker run aborted from Jenkins: Jenkins' default ~5s SoftKillWaitSeconds (and per-pod terminationGracePeriodSeconds) makes parallel log collection unwinnable on the abort path. Logs already captured inline by the failure handler remain on disk.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Splits the cleanup path in the PCv2 mission into a signal-triggered variant (skips log collection, goes straight to helm uninstall) and a normal variant (collects logs first), to avoid leaking worker pods when Jenkins aborts a run and SIGKILLs the process before log collection finishes.

Changes:

  • Add signalTriggered parameter to cleanup to distinguish abort vs normal cleanup.
  • Skip log collection on abort path; preserve existing log-collection behavior on normal failure/exit.
  • Update ProcessExit/CancelKeyPress handlers to pass true, and the in-mission failure/normal exit sites to pass false.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The PCv2 cleanup function was collecting logs from every worker pod before
calling `helm uninstall`. With up to 1024 workers and serial per-pod log
collection, the cleanup would routinely take minutes — far exceeding the ~5s
Jenkins gives between SIGTERM and SIGKILL. Result: aborted PCv2 runs were
leaving entire worker fleets running after the Jenkins job died, because
`helm uninstall` never got a chance to run.

Add a `signalTriggered` parameter to `cleanup`. Signal handlers
(ProcessExit, CancelKeyPress) now pass `true`, which skips log collection
and goes straight to `helm uninstall`. Normal / legitimate-failure callers
pass `false`, preserving the existing logs-first behavior so debugging
output isn't lost when pods would still be alive to read from anyway.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants