#5431 changed Scheduler.decide_worker to stop it from assigning new tasks to workers with paused or closing_gracefully status.
However, those workers are still stealing tasks - effectively negating the benefits of the PR.
This is reflected by the flakiness of test_avoid_paused_workers; the test frequently hangs in CI on the lines
while (len(w1.tasks), len(w2.tasks), len(w3.tasks)) != (4, 0, 4):
await asyncio.sleep(0.01)
Above, w2 is paused. However, the tuple ends up looking like (3, 1, 4) instead. If you add await wait(futures), the test will start hanging deterministically since a task is always stolen from one of the running workers to the paused one, and there it sits forever since nothing steals it back.
Adding , config={"distributed.scheduler.work-stealing": False} to the gen_cluster decorator makes the issue disappear.
#5431 changed
Scheduler.decide_workerto stop it from assigning new tasks to workers with paused or closing_gracefully status.However, those workers are still stealing tasks - effectively negating the benefits of the PR.
This is reflected by the flakiness of
test_avoid_paused_workers; the test frequently hangs in CI on the linesAbove, w2 is paused. However, the tuple ends up looking like
(3, 1, 4)instead. If you addawait wait(futures), the test will start hanging deterministically since a task is always stolen from one of the running workers to the paused one, and there it sits forever since nothing steals it back.Adding
, config={"distributed.scheduler.work-stealing": False}to the gen_cluster decorator makes the issue disappear.