fix: KeyError on concurrent same-key job attempts in job_task_contexts by drizzt · Pull Request #293 · tobymao/saq

drizzt · 2026-04-21T00:38:35Z

What and why

Job.__hash__ is derived from .key, so two in-flight attempts of the same-key job collapse onto a single slot in Worker.job_task_contexts. This happens in practice when a sweep or rolling-restart retries a job whose previous attempt is still winding down. The first attempt's finally block pops the second attempt's entry, and the second attempt then crashes at the completion check:

File ".../saq/worker.py", line 374, in process
    if self.job_task_contexts[job]["aborted"] is None:
KeyError: Job<...>

Fix

Key job_task_contexts by id(job) (unique per Python Job object) instead of the Job itself, and carry the Job inside JobTaskContext so abort() can still correlate dict entries with refreshed queue state. Each attempt now occupies its own slot and the finally pop is scoped to that attempt.

abort() fetches fresh queue state once per key and applies it to every local attempt tracking that key, preserving the existing "one refresh per key" network pattern.

Tests

Adds tests/test_worker.py::test_job_task_contexts_same_key_isolation covering the two-attempts-same-key case.

tobymao · 2026-04-21T01:16:21Z

i don't really understand the purpose of this, there must be a simpler way, job.key is globally unique, id() is not stable

this fix doesn't seem correct, i don't see why job context can't be keyed on job key

tobymao · 2026-04-21T01:17:07Z


+    async def test_job_task_contexts_same_key_isolation(self) -> None:
+        """Two concurrent attempts of a same-``key`` job keep isolated contexts."""
+        from saq.types import JobTaskContext


this test seems contrived, i don't really get how it happens in real life

tobymao · 2026-04-21T17:50:03Z

please fix the build

drizzt · 2026-04-21T18:15:08Z

Sorry, I'm working on the new implementation as you asked...

By the way, it is a real bug. I had ~30 times in 2 days, before I introduced my fix, in production and this was the reason of my PR:

Traceback (most recent call last):
  File "/opt/ubot/.venv/lib64/python3.12/site-packages/saq/worker.py", line 374, in process
    if self.job_task_contexts[job]["aborted"] is None:
       ~~~~~~~~~~~~~~~~~~~~~~^^^^^
KeyError: Job<function: send, kwargs: {'user_id': 1, 'folder_id': '1', 'job_id': '7ffae5c3-8d1b-4bc1-96dd-5d3fa7f312d6', 'is_scheduled': True}, queue: ubot, key: chat:1, timeout: 600, heartbeat: 120, ttl: 60, scheduled: 1776728001, attempts: 2, queued: 1776727800983, started: 1776728145390, touched: 1776728465491, error: cancelled, status: active, worker_id: a727541e-3d11-11f1-9ffb-00001702cc98>

Note attempts: 2 and error: cancelled — sweep had already cancelled-and-retried the original, and the retry crashed at the completion check.

The race:

Attempt A registers job_task_contexts[job] = ctxA, enters wait_for.
Sweep aborts the stuck attempt and re-enqueues.
Attempt B dequeues the retry (new Job instance, same .key → same hash) and stores ctxB, overwriting A's slot.
A's finally pops the slot — which now holds ctxB.
B's completion check hits self.job_task_contexts[job]["aborted"] → KeyError.

Keying on job.key alone doesn't fix it — same-key attempts still share one slot; the problem is the finally pop is unconditional. So in the revised fix job_task_contexts stays keyed by Job (which hashes by .key).
Inside process():

hold a local task_ctx reference and read "aborted" through it (no dict lookup on the completion path);
in the finally, only delete the slot if it still points at our own ctx.

No id(), no new fields, abort() unchanged.

Job.__hash__ is derived from .key, so two in-flight attempts of the same-key job (e.g. a sweep-driven retry of a job whose previous attempt is still winding down) share one slot in Worker.job_task_contexts. Whichever attempt's finally pops the slot first leaves the other to raise KeyError at the completion check: File ".../saq/worker.py", line 374, in process if self.job_task_contexts[job]["aborted"] is None: KeyError: Job<... key: chat:1, attempts: 2, error: cancelled, status: active> Keep the dict keyed by Job (i.e. by .key), but hold a local reference to the JobTaskContext inside process() and read "aborted" through it. In the finally, only remove the slot if it still holds our ctx -- a successor attempt's slot is preserved. abort() is unchanged. Adds a regression test that reproduces the race by running two process() coroutines with same-key jobs and asserting both complete.

drizzt · 2026-04-21T19:09:11Z

Now it pass any test again: https://github.com/drizzt/saq/actions/runs/24740433208

roberttabellyn-sketch

Ótimo

tobymao reviewed Apr 21, 2026

View reviewed changes

drizzt force-pushed the fix/job-task-contexts-race branch 2 times, most recently from ee41d15 to 139dc0b Compare April 21, 2026 17:35

drizzt marked this pull request as draft April 21, 2026 18:11

drizzt force-pushed the fix/job-task-contexts-race branch from 139dc0b to c869b18 Compare April 21, 2026 18:23

drizzt force-pushed the fix/job-task-contexts-race branch from c869b18 to de0cdd4 Compare April 21, 2026 18:49

drizzt marked this pull request as ready for review April 21, 2026 19:09

roberttabellyn-sketch reviewed May 13, 2026

View reviewed changes

tobymao merged commit f832999 into tobymao:main May 13, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: KeyError on concurrent same-key job attempts in job_task_contexts#293

fix: KeyError on concurrent same-key job attempts in job_task_contexts#293
tobymao merged 1 commit into
tobymao:mainfrom
drizzt:fix/job-task-contexts-race

drizzt commented Apr 21, 2026

Uh oh!

tobymao commented Apr 21, 2026 •

edited

Loading

Uh oh!

tobymao Apr 21, 2026

Uh oh!

tobymao commented Apr 21, 2026

Uh oh!

drizzt commented Apr 21, 2026

Uh oh!

drizzt commented Apr 21, 2026

Uh oh!

roberttabellyn-sketch left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

drizzt commented Apr 21, 2026

What and why

Fix

Tests

Uh oh!

tobymao commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tobymao Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

tobymao commented Apr 21, 2026

Uh oh!

drizzt commented Apr 21, 2026

Uh oh!

drizzt commented Apr 21, 2026

Uh oh!

roberttabellyn-sketch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tobymao commented Apr 21, 2026 •

edited

Loading