Skip to content

fix: KeyError on concurrent same-key job attempts in job_task_contexts#293

Merged
tobymao merged 1 commit into
tobymao:mainfrom
drizzt:fix/job-task-contexts-race
May 13, 2026
Merged

fix: KeyError on concurrent same-key job attempts in job_task_contexts#293
tobymao merged 1 commit into
tobymao:mainfrom
drizzt:fix/job-task-contexts-race

Conversation

@drizzt

@drizzt drizzt commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

What and why

Job.__hash__ is derived from .key, so two in-flight attempts of the same-key job collapse onto a single slot in Worker.job_task_contexts. This happens in practice when a sweep or rolling-restart retries a job whose previous attempt is still winding down. The first attempt's finally block pops the second attempt's entry, and the second attempt then crashes at the completion check:

File ".../saq/worker.py", line 374, in process
    if self.job_task_contexts[job]["aborted"] is None:
KeyError: Job<...>

Fix

Key job_task_contexts by id(job) (unique per Python Job object) instead of the Job itself, and carry the Job inside JobTaskContext so abort() can still correlate dict entries with refreshed queue state. Each attempt now occupies its own slot and the finally pop is scoped to that attempt.

abort() fetches fresh queue state once per key and applies it to every local attempt tracking that key, preserving the existing "one refresh per key" network pattern.

Tests

Adds tests/test_worker.py::test_job_task_contexts_same_key_isolation covering the two-attempts-same-key case.

@tobymao

tobymao commented Apr 21, 2026

Copy link
Copy Markdown
Owner

i don't really understand the purpose of this, there must be a simpler way, job.key is globally unique, id() is not stable

this fix doesn't seem correct, i don't see why job context can't be keyed on job key

Comment thread tests/test_worker.py Outdated

async def test_job_task_contexts_same_key_isolation(self) -> None:
"""Two concurrent attempts of a same-``key`` job keep isolated contexts."""
from saq.types import JobTaskContext

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test seems contrived, i don't really get how it happens in real life

@drizzt drizzt force-pushed the fix/job-task-contexts-race branch 2 times, most recently from ee41d15 to 139dc0b Compare April 21, 2026 17:35
@tobymao

tobymao commented Apr 21, 2026

Copy link
Copy Markdown
Owner

please fix the build

@drizzt drizzt marked this pull request as draft April 21, 2026 18:11
@drizzt

drizzt commented Apr 21, 2026

Copy link
Copy Markdown
Contributor Author

Sorry, I'm working on the new implementation as you asked...

By the way, it is a real bug. I had ~30 times in 2 days, before I introduced my fix, in production and this was the reason of my PR:

Traceback (most recent call last):
  File "/opt/ubot/.venv/lib64/python3.12/site-packages/saq/worker.py", line 374, in process
    if self.job_task_contexts[job]["aborted"] is None:
       ~~~~~~~~~~~~~~~~~~~~~~^^^^^
KeyError: Job<function: send, kwargs: {'user_id': 1, 'folder_id': '1', 'job_id': '7ffae5c3-8d1b-4bc1-96dd-5d3fa7f312d6', 'is_scheduled': True}, queue: ubot, key: chat:1, timeout: 600, heartbeat: 120, ttl: 60, scheduled: 1776728001, attempts: 2, queued: 1776727800983, started: 1776728145390, touched: 1776728465491, error: cancelled, status: active, worker_id: a727541e-3d11-11f1-9ffb-00001702cc98>

Note attempts: 2 and error: cancelled — sweep had already cancelled-and-retried the original, and the retry crashed at the completion check.

The race:

  1. Attempt A registers job_task_contexts[job] = ctxA, enters wait_for.
  2. Sweep aborts the stuck attempt and re-enqueues.
  3. Attempt B dequeues the retry (new Job instance, same .key → same hash) and stores ctxB, overwriting A's slot.
  4. A's finally pops the slot — which now holds ctxB.
  5. B's completion check hits self.job_task_contexts[job]["aborted"]KeyError.

Keying on job.key alone doesn't fix it — same-key attempts still share one slot; the problem is the finally pop is unconditional. So in the revised fix job_task_contexts stays keyed by Job (which hashes by .key).
Inside process():

  • hold a local task_ctx reference and read "aborted" through it (no dict lookup on the completion path);
  • in the finally, only delete the slot if it still points at our own ctx.

No id(), no new fields, abort() unchanged.

@drizzt drizzt force-pushed the fix/job-task-contexts-race branch from 139dc0b to c869b18 Compare April 21, 2026 18:23
Job.__hash__ is derived from .key, so two in-flight attempts of the
same-key job (e.g. a sweep-driven retry of a job whose previous attempt
is still winding down) share one slot in Worker.job_task_contexts.
Whichever attempt's finally pops the slot first leaves the other to
raise KeyError at the completion check:

    File ".../saq/worker.py", line 374, in process
        if self.job_task_contexts[job]["aborted"] is None:
    KeyError: Job<... key: chat:1, attempts: 2, error: cancelled, status: active>

Keep the dict keyed by Job (i.e. by .key), but hold a local reference to
the JobTaskContext inside process() and read "aborted" through it. In
the finally, only remove the slot if it still holds our ctx -- a
successor attempt's slot is preserved. abort() is unchanged.

Adds a regression test that reproduces the race by running two
process() coroutines with same-key jobs and asserting both complete.
@drizzt drizzt force-pushed the fix/job-task-contexts-race branch from c869b18 to de0cdd4 Compare April 21, 2026 18:49
@drizzt drizzt marked this pull request as ready for review April 21, 2026 19:09
@drizzt

drizzt commented Apr 21, 2026

Copy link
Copy Markdown
Contributor Author

Now it pass any test again: https://github.com/drizzt/saq/actions/runs/24740433208

@roberttabellyn-sketch roberttabellyn-sketch left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ótimo

@tobymao tobymao merged commit f832999 into tobymao:main May 13, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants