Django Redis cache connections are missing `socket_keepalive`, so pooled connections get silently dropped by the cloud firewall after ~1h idle and the next job to grab one from the pool dies with `ECONNRESET`

## Summary

`config/settings/base.py` configures `CELERY_BROKER_TRANSPORT_OPTIONS` with a full
`socket_keepalive: True` + `socket_settings: {TCP_KEEPIDLE: 60, TCP_KEEPINTVL: 10, TCP_KEEPCNT: 9}`
block (lines 402–413). This protects the Celery→broker connection from the
1-hour idle-connection drop that stateful cloud firewalls impose.

The **`CACHES`** block on lines 253–264 does **not** set any keepalive options:

```python
CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": env("REDIS_URL", default=None),
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "IGNORE_EXCEPTIONS": True,
            # no socket_keepalive, no socket_keepalive_options
        },
    }
}
```

`redis-py` does **not** set `SO_KEEPALIVE` by default. Without `SO_KEEPALIVE`, the
kernel never sends TCP keepalive probes on the socket — regardless of what
`net.ipv4.tcp_keepalive_time` / `tcp_keepalive_intvl` / `tcp_keepalive_probes`
are set to at the host level. The host-level sysctls only apply to sockets that
have explicitly enabled `SO_KEEPALIVE`.

Result: Redis cache connections pooled by `django-redis` can sit idle for more
than the cloud firewall's connection-tracking timeout (typically 3600s on
OpenStack Neutron-based platforms), get silently dropped, and then fail with
`ECONNRESET` the next time the pool hands them out.

## Direct observation

On a production worker, we compared active TCP connections from the celery
worker container to `redis:6379`:

```
$ ss -ton state established dport = :6379
Recv-Q Send-Q Local Address:Port   Peer Address:Port     Process
0      0      172.x.x.x:35296      192.168.x.x:6379      ← no keepalive timer
0      0      172.x.x.x:37462      192.168.x.x:6379      ← no keepalive timer
0      0      172.x.x.x:47014      192.168.x.x:6379      timer:(keepalive,99min,0)
0      0      172.x.x.x:47004      192.168.x.x:6379      ← no keepalive timer
0      0      172.x.x.x:37482      192.168.x.x:6379      timer:(keepalive,100min,0)
0      0      172.x.x.x:35320      192.168.x.x:6379      timer:(keepalive,100min,0)
0      0      172.x.x.x:37478      192.168.x.x:6379      timer:(keepalive,100min,0)
0      0      172.x.x.x:37456      192.168.x.x:6379      ← no keepalive timer
0      0      172.x.x.x:60822      192.168.x.x:6379      ← no keepalive timer
0      0      172.x.x.x:60842      192.168.x.x:6379      timer:(keepalive,99min,0)
```

**5 of the 10 connections have no keepalive timer.** Those are `django-redis`
cache connections. The other 5 are Celery result-backend connections which
inherit keepalive from Celery's transport options. The split is ~50/50 at
steady state.

Any of the unprotected 5 can be silently dropped by the cloud firewall after
it sits in the connection pool past the firewall's idle timeout.

## Concrete incident

A production async_api job was processing successfully — 507 / 840 images
completed, 0 failures, thousands of detections and classifications already
saved to the database, celery worker logs showing `process_nats_pipeline_result`
tasks succeeding at sub-second intervals.

Then one task hit a Redis connection reset:

```
20:39:51,279 ERROR  Redis error updating job <id> state:
    Error while reading from redis:6379 : (104, 'Connection reset by peer')
20:39:51,432 ERROR  Job <id> marked as FAILURE: Redis state missing for job
```

153ms between the transient error and `_fail_job` being called. Other tasks on
the same celery worker had hit Redis successfully milliseconds before and
milliseconds after this event — it was a single connection, not a Redis
outage.

The failure path (`_fail_job` in `ami/jobs/tasks.py`) then:
- Marked the job FAILURE
- Wiped the Redis state
- Deleted the NATS stream and consumer

…rendering the job irrecoverable. The hundreds of already-processed images are
still in the database, but the job record says FAILURE with 0 failed images,
which is confusing at best.

The code-path brittleness (`update_state` mapping `RedisError` to `return None`,
which the caller treats as permanent) is tracked separately — see the companion
ticket on conflation of transient and terminal errors in
`AsyncJobStateManager.update_state`.

This ticket is about the **config gap that makes the transient happen in the
first place**.

## Why the host-level keepalive fix alone isn't sufficient

This deployment already has sysctl-level keepalive tuning applied at the host:

```
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 9
```

That's the right fix for the cloud firewall issue, but it only takes effect on
sockets where the application has called `setsockopt(SO_KEEPALIVE, 1)`. The
Celery broker path does this (via `socket_keepalive` in
`CELERY_BROKER_TRANSPORT_OPTIONS`). The django-redis cache path does not.

A host-level fix cannot compensate for an application that never opts its
sockets in.

## Proposed fix

Add `socket_keepalive` and `socket_keepalive_options` to the CACHES default
options, mirroring what the Celery broker already does:

```python
import socket

CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": env("REDIS_URL", default=None),
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "IGNORE_EXCEPTIONS": True,
            # Keepalive TCP options — prevents cloud firewalls from silently
            # dropping pooled Redis connections after their idle timeout.
            # Matches CELERY_BROKER_TRANSPORT_OPTIONS.socket_settings.
            "SOCKET_KEEPALIVE": True,
            "SOCKET_KEEPALIVE_OPTIONS": {
                socket.TCP_KEEPIDLE: 60,
                socket.TCP_KEEPINTVL: 10,
                socket.TCP_KEEPCNT: 9,
            },
            # Optional but recommended: also set a socket connect timeout
            # so a hung DNS / TCP handshake doesn't block the worker forever.
            "SOCKET_CONNECT_TIMEOUT": 5,
        },
    }
}
```

`django-redis` passes `SOCKET_KEEPALIVE` and `SOCKET_KEEPALIVE_OPTIONS` through
to the underlying `redis.Connection` constructor, which calls
`setsockopt(SO_KEEPALIVE, 1)` and then applies each option in
`SOCKET_KEEPALIVE_OPTIONS` via `setsockopt(SOL_TCP, opt, val)`.

Reference: https://github.com/jazzband/django-redis/blob/master/django_redis/pool.py
and https://redis-py.readthedocs.io/en/stable/connections.html (look for
`socket_keepalive` and `socket_keepalive_options`).

## Verification after the fix

After deploying, run inside the celery worker container:

```bash
ss -ton state established dport = :6379
```

Every established Redis connection should now show a `timer:(keepalive,...)`
entry. No more bare connections.

## Scope clarification

This is one of two bugs that together caused a production job to fail:

1. (this ticket) Cloud firewall drops idle Redis cache connections because
   `django-redis` sockets aren't SO_KEEPALIVE-enabled.
2. (companion ticket) `AsyncJobStateManager.update_state()` treats a
   `RedisError` the same as "state genuinely missing", so a single transient
   kills the job with no retry.

Fixing (1) makes the transient much rarer. Fixing (2) makes the remaining
transients survivable. Fix both.

## Related

- `setup/apply_keepalive_fix.sh` in the infra repo (internal) — already applied
  the sysctls at the host level but doesn't (and can't) fix the missing
  application-level SO_KEEPALIVE opt-in.
- #1174 — fail fast on NATS errors (sibling: outbound connection robustness)
- #1189 — evaluate Celery result backend & broker architecture

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Django Redis cache connections are missing `socket_keepalive`, so pooled connections get silently dropped by the cloud firewall after ~1h idle and the next job to grab one from the pool dies with `ECONNRESET` #1218

Summary

Direct observation

Concrete incident

Why the host-level keepalive fix alone isn't sufficient

Proposed fix

Verification after the fix

Scope clarification

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Django Redis cache connections are missing socket_keepalive, so pooled connections get silently dropped by the cloud firewall after ~1h idle and the next job to grab one from the pool dies with ECONNRESET #1218

Description

Summary

Direct observation

Concrete incident

Why the host-level keepalive fix alone isn't sufficient

Proposed fix

Verification after the fix

Scope clarification

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Django Redis cache connections are missing `socket_keepalive`, so pooled connections get silently dropped by the cloud firewall after ~1h idle and the next job to grab one from the pool dies with `ECONNRESET` #1218