Skip to content

Django Redis cache connections are missing socket_keepalive, so pooled connections get silently dropped by the cloud firewall after ~1h idle and the next job to grab one from the pool dies with ECONNRESET #1218

@mihow

Description

@mihow

Summary

config/settings/base.py configures CELERY_BROKER_TRANSPORT_OPTIONS with a full
socket_keepalive: True + socket_settings: {TCP_KEEPIDLE: 60, TCP_KEEPINTVL: 10, TCP_KEEPCNT: 9}
block (lines 402–413). This protects the Celery→broker connection from the
1-hour idle-connection drop that stateful cloud firewalls impose.

The CACHES block on lines 253–264 does not set any keepalive options:

CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": env("REDIS_URL", default=None),
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "IGNORE_EXCEPTIONS": True,
            # no socket_keepalive, no socket_keepalive_options
        },
    }
}

redis-py does not set SO_KEEPALIVE by default. Without SO_KEEPALIVE, the
kernel never sends TCP keepalive probes on the socket — regardless of what
net.ipv4.tcp_keepalive_time / tcp_keepalive_intvl / tcp_keepalive_probes
are set to at the host level. The host-level sysctls only apply to sockets that
have explicitly enabled SO_KEEPALIVE.

Result: Redis cache connections pooled by django-redis can sit idle for more
than the cloud firewall's connection-tracking timeout (typically 3600s on
OpenStack Neutron-based platforms), get silently dropped, and then fail with
ECONNRESET the next time the pool hands them out.

Direct observation

On a production worker, we compared active TCP connections from the celery
worker container to redis:6379:

$ ss -ton state established dport = :6379
Recv-Q Send-Q Local Address:Port   Peer Address:Port     Process
0      0      172.x.x.x:35296      192.168.x.x:6379      ← no keepalive timer
0      0      172.x.x.x:37462      192.168.x.x:6379      ← no keepalive timer
0      0      172.x.x.x:47014      192.168.x.x:6379      timer:(keepalive,99min,0)
0      0      172.x.x.x:47004      192.168.x.x:6379      ← no keepalive timer
0      0      172.x.x.x:37482      192.168.x.x:6379      timer:(keepalive,100min,0)
0      0      172.x.x.x:35320      192.168.x.x:6379      timer:(keepalive,100min,0)
0      0      172.x.x.x:37478      192.168.x.x:6379      timer:(keepalive,100min,0)
0      0      172.x.x.x:37456      192.168.x.x:6379      ← no keepalive timer
0      0      172.x.x.x:60822      192.168.x.x:6379      ← no keepalive timer
0      0      172.x.x.x:60842      192.168.x.x:6379      timer:(keepalive,99min,0)

5 of the 10 connections have no keepalive timer. Those are django-redis
cache connections. The other 5 are Celery result-backend connections which
inherit keepalive from Celery's transport options. The split is ~50/50 at
steady state.

Any of the unprotected 5 can be silently dropped by the cloud firewall after
it sits in the connection pool past the firewall's idle timeout.

Concrete incident

A production async_api job was processing successfully — 507 / 840 images
completed, 0 failures, thousands of detections and classifications already
saved to the database, celery worker logs showing process_nats_pipeline_result
tasks succeeding at sub-second intervals.

Then one task hit a Redis connection reset:

20:39:51,279 ERROR  Redis error updating job <id> state:
    Error while reading from redis:6379 : (104, 'Connection reset by peer')
20:39:51,432 ERROR  Job <id> marked as FAILURE: Redis state missing for job

153ms between the transient error and _fail_job being called. Other tasks on
the same celery worker had hit Redis successfully milliseconds before and
milliseconds after this event — it was a single connection, not a Redis
outage.

The failure path (_fail_job in ami/jobs/tasks.py) then:

  • Marked the job FAILURE
  • Wiped the Redis state
  • Deleted the NATS stream and consumer

…rendering the job irrecoverable. The hundreds of already-processed images are
still in the database, but the job record says FAILURE with 0 failed images,
which is confusing at best.

The code-path brittleness (update_state mapping RedisError to return None,
which the caller treats as permanent) is tracked separately — see the companion
ticket on conflation of transient and terminal errors in
AsyncJobStateManager.update_state.

This ticket is about the config gap that makes the transient happen in the
first place
.

Why the host-level keepalive fix alone isn't sufficient

This deployment already has sysctl-level keepalive tuning applied at the host:

net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 9

That's the right fix for the cloud firewall issue, but it only takes effect on
sockets where the application has called setsockopt(SO_KEEPALIVE, 1). The
Celery broker path does this (via socket_keepalive in
CELERY_BROKER_TRANSPORT_OPTIONS). The django-redis cache path does not.

A host-level fix cannot compensate for an application that never opts its
sockets in.

Proposed fix

Add socket_keepalive and socket_keepalive_options to the CACHES default
options, mirroring what the Celery broker already does:

import socket

CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": env("REDIS_URL", default=None),
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "IGNORE_EXCEPTIONS": True,
            # Keepalive TCP options — prevents cloud firewalls from silently
            # dropping pooled Redis connections after their idle timeout.
            # Matches CELERY_BROKER_TRANSPORT_OPTIONS.socket_settings.
            "SOCKET_KEEPALIVE": True,
            "SOCKET_KEEPALIVE_OPTIONS": {
                socket.TCP_KEEPIDLE: 60,
                socket.TCP_KEEPINTVL: 10,
                socket.TCP_KEEPCNT: 9,
            },
            # Optional but recommended: also set a socket connect timeout
            # so a hung DNS / TCP handshake doesn't block the worker forever.
            "SOCKET_CONNECT_TIMEOUT": 5,
        },
    }
}

django-redis passes SOCKET_KEEPALIVE and SOCKET_KEEPALIVE_OPTIONS through
to the underlying redis.Connection constructor, which calls
setsockopt(SO_KEEPALIVE, 1) and then applies each option in
SOCKET_KEEPALIVE_OPTIONS via setsockopt(SOL_TCP, opt, val).

Reference: https://github.com/jazzband/django-redis/blob/master/django_redis/pool.py
and https://redis-py.readthedocs.io/en/stable/connections.html (look for
socket_keepalive and socket_keepalive_options).

Verification after the fix

After deploying, run inside the celery worker container:

ss -ton state established dport = :6379

Every established Redis connection should now show a timer:(keepalive,...)
entry. No more bare connections.

Scope clarification

This is one of two bugs that together caused a production job to fail:

  1. (this ticket) Cloud firewall drops idle Redis cache connections because
    django-redis sockets aren't SO_KEEPALIVE-enabled.
  2. (companion ticket) AsyncJobStateManager.update_state() treats a
    RedisError the same as "state genuinely missing", so a single transient
    kills the job with no retry.

Fixing (1) makes the transient much rarer. Fixing (2) makes the remaining
transients survivable. Fix both.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions