TICKET-597: Fix worker orphan processes and add max test timeout cap#710
Conversation
98b4235 to
e17503a
Compare
e2fbad0 to
e17503a
Compare
e17503a to
bd5f44c
Compare
donny-wong
left a comment
There was a problem hiding this comment.
Thanks @Naragod. There are occasions where err could be empty because the OS kills the process immediately (like SIGKILL or OOM kills), and the script won't have time to print the traceback to stderr, and so the err will be an empty string. Please take into account of proc.returncode which ensures we detect these silent crashes reliably. Thank you.
Good catch. I have added an extra check. Please take another look. |
david-yz-liu
left a comment
There was a problem hiding this comment.
@Naragod nice work. In addition to the inline comments I left, please also make sure to update the Changelog.
85f20f3 to
3a8adff
Compare
15513f1 to
1596f78
Compare
for more information, see https://pre-commit.ci
Description
Worker processes running student tests could become orphaned when the
RQworker crashed or restarted before cleanup ran, becausestart_new_session=Truefully detached test processes into their own session. This made them invisible to process supervision and allowed them to run indefinitely, blocking the dedicated test user slot.Implementation
start_new_session=Truefrom the subprocess call in order to reap test processes when the worker dies.max_test_timeoutserver-side configuration (default 600s) that caps per-test timeouts, preventing instructor-configured or missing timeouts from letting tests hang forever during normal operation.start_new_sessionmeansos.killpgcan no longer be used in the same-user (dev) path without killing the worker itself, so it is replaced with proc.kill() + proc.wait() to target the child process directly.