When a subprocess exits, there is no generic notification to the operator. While this isn't a bad design, in practice most of our subprocesses are not intended to be limited in duration and so this creates a trap where it takes a long time to notice that the subprocess we cared about isn't behaving as designed.
Typical design:
Things that could be done - either you make the restart more automatic (a) or make the crash more obvious.
(a) probably easiest in modifying subprocess mixin basic_control_target (or have a continuous control version) where on failing the is_alive check, it restarts the worker
(b) the strongest method would be to overwrite the ping functionality so it checks if the worker is alive (only works for a single level of worker); could also make the cleanup method spam a lot more errors
- either way, could check the cleanup behavior of classes inheriting to ensure they put out sufficient exit information and harden against those failure modes
When a subprocess exits, there is no generic notification to the operator. While this isn't a bad design, in practice most of our subprocesses are not intended to be limited in duration and so this creates a trap where it takes a long time to notice that the subprocess we cared about isn't behaving as designed.
Typical design:
Things that could be done - either you make the restart more automatic (a) or make the crash more obvious.
(a) probably easiest in modifying subprocess mixin basic_control_target (or have a continuous control version) where on failing the is_alive check, it restarts the worker
(b) the strongest method would be to overwrite the ping functionality so it checks if the worker is alive (only works for a single level of worker); could also make the cleanup method spam a lot more errors