Skip to content

Paused workers shouldn't steal tasks (flaky test_avoid_paused_workers) #5664

@crusaderky

Description

@crusaderky

#5431 changed Scheduler.decide_worker to stop it from assigning new tasks to workers with paused or closing_gracefully status.
However, those workers are still stealing tasks - effectively negating the benefits of the PR.

This is reflected by the flakiness of test_avoid_paused_workers; the test frequently hangs in CI on the lines

    while (len(w1.tasks), len(w2.tasks), len(w3.tasks)) != (4, 0, 4):
        await asyncio.sleep(0.01)

Above, w2 is paused. However, the tuple ends up looking like (3, 1, 4) instead. If you add await wait(futures), the test will start hanging deterministically since a task is always stolen from one of the running workers to the paused one, and there it sits forever since nothing steals it back.

Adding , config={"distributed.scheduler.work-stealing": False} to the gen_cluster decorator makes the issue disappear.

Metadata

Metadata

Assignees

Labels

flaky testIntermittent failures on CI.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions