Summary
config/settings/base.py configures CELERY_BROKER_TRANSPORT_OPTIONS with a full
socket_keepalive: True + socket_settings: {TCP_KEEPIDLE: 60, TCP_KEEPINTVL: 10, TCP_KEEPCNT: 9}
block (lines 402–413). This protects the Celery→broker connection from the
1-hour idle-connection drop that stateful cloud firewalls impose.
The CACHES block on lines 253–264 does not set any keepalive options:
CACHES = {
"default": {
"BACKEND": "django_redis.cache.RedisCache",
"LOCATION": env("REDIS_URL", default=None),
"OPTIONS": {
"CLIENT_CLASS": "django_redis.client.DefaultClient",
"IGNORE_EXCEPTIONS": True,
# no socket_keepalive, no socket_keepalive_options
},
}
}
redis-py does not set SO_KEEPALIVE by default. Without SO_KEEPALIVE, the
kernel never sends TCP keepalive probes on the socket — regardless of what
net.ipv4.tcp_keepalive_time / tcp_keepalive_intvl / tcp_keepalive_probes
are set to at the host level. The host-level sysctls only apply to sockets that
have explicitly enabled SO_KEEPALIVE.
Result: Redis cache connections pooled by django-redis can sit idle for more
than the cloud firewall's connection-tracking timeout (typically 3600s on
OpenStack Neutron-based platforms), get silently dropped, and then fail with
ECONNRESET the next time the pool hands them out.
Direct observation
On a production worker, we compared active TCP connections from the celery
worker container to redis:6379:
$ ss -ton state established dport = :6379
Recv-Q Send-Q Local Address:Port Peer Address:Port Process
0 0 172.x.x.x:35296 192.168.x.x:6379 ← no keepalive timer
0 0 172.x.x.x:37462 192.168.x.x:6379 ← no keepalive timer
0 0 172.x.x.x:47014 192.168.x.x:6379 timer:(keepalive,99min,0)
0 0 172.x.x.x:47004 192.168.x.x:6379 ← no keepalive timer
0 0 172.x.x.x:37482 192.168.x.x:6379 timer:(keepalive,100min,0)
0 0 172.x.x.x:35320 192.168.x.x:6379 timer:(keepalive,100min,0)
0 0 172.x.x.x:37478 192.168.x.x:6379 timer:(keepalive,100min,0)
0 0 172.x.x.x:37456 192.168.x.x:6379 ← no keepalive timer
0 0 172.x.x.x:60822 192.168.x.x:6379 ← no keepalive timer
0 0 172.x.x.x:60842 192.168.x.x:6379 timer:(keepalive,99min,0)
5 of the 10 connections have no keepalive timer. Those are django-redis
cache connections. The other 5 are Celery result-backend connections which
inherit keepalive from Celery's transport options. The split is ~50/50 at
steady state.
Any of the unprotected 5 can be silently dropped by the cloud firewall after
it sits in the connection pool past the firewall's idle timeout.
Concrete incident
A production async_api job was processing successfully — 507 / 840 images
completed, 0 failures, thousands of detections and classifications already
saved to the database, celery worker logs showing process_nats_pipeline_result
tasks succeeding at sub-second intervals.
Then one task hit a Redis connection reset:
20:39:51,279 ERROR Redis error updating job <id> state:
Error while reading from redis:6379 : (104, 'Connection reset by peer')
20:39:51,432 ERROR Job <id> marked as FAILURE: Redis state missing for job
153ms between the transient error and _fail_job being called. Other tasks on
the same celery worker had hit Redis successfully milliseconds before and
milliseconds after this event — it was a single connection, not a Redis
outage.
The failure path (_fail_job in ami/jobs/tasks.py) then:
- Marked the job FAILURE
- Wiped the Redis state
- Deleted the NATS stream and consumer
…rendering the job irrecoverable. The hundreds of already-processed images are
still in the database, but the job record says FAILURE with 0 failed images,
which is confusing at best.
The code-path brittleness (update_state mapping RedisError to return None,
which the caller treats as permanent) is tracked separately — see the companion
ticket on conflation of transient and terminal errors in
AsyncJobStateManager.update_state.
This ticket is about the config gap that makes the transient happen in the
first place.
Why the host-level keepalive fix alone isn't sufficient
This deployment already has sysctl-level keepalive tuning applied at the host:
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 9
That's the right fix for the cloud firewall issue, but it only takes effect on
sockets where the application has called setsockopt(SO_KEEPALIVE, 1). The
Celery broker path does this (via socket_keepalive in
CELERY_BROKER_TRANSPORT_OPTIONS). The django-redis cache path does not.
A host-level fix cannot compensate for an application that never opts its
sockets in.
Proposed fix
Add socket_keepalive and socket_keepalive_options to the CACHES default
options, mirroring what the Celery broker already does:
import socket
CACHES = {
"default": {
"BACKEND": "django_redis.cache.RedisCache",
"LOCATION": env("REDIS_URL", default=None),
"OPTIONS": {
"CLIENT_CLASS": "django_redis.client.DefaultClient",
"IGNORE_EXCEPTIONS": True,
# Keepalive TCP options — prevents cloud firewalls from silently
# dropping pooled Redis connections after their idle timeout.
# Matches CELERY_BROKER_TRANSPORT_OPTIONS.socket_settings.
"SOCKET_KEEPALIVE": True,
"SOCKET_KEEPALIVE_OPTIONS": {
socket.TCP_KEEPIDLE: 60,
socket.TCP_KEEPINTVL: 10,
socket.TCP_KEEPCNT: 9,
},
# Optional but recommended: also set a socket connect timeout
# so a hung DNS / TCP handshake doesn't block the worker forever.
"SOCKET_CONNECT_TIMEOUT": 5,
},
}
}
django-redis passes SOCKET_KEEPALIVE and SOCKET_KEEPALIVE_OPTIONS through
to the underlying redis.Connection constructor, which calls
setsockopt(SO_KEEPALIVE, 1) and then applies each option in
SOCKET_KEEPALIVE_OPTIONS via setsockopt(SOL_TCP, opt, val).
Reference: https://github.com/jazzband/django-redis/blob/master/django_redis/pool.py
and https://redis-py.readthedocs.io/en/stable/connections.html (look for
socket_keepalive and socket_keepalive_options).
Verification after the fix
After deploying, run inside the celery worker container:
ss -ton state established dport = :6379
Every established Redis connection should now show a timer:(keepalive,...)
entry. No more bare connections.
Scope clarification
This is one of two bugs that together caused a production job to fail:
- (this ticket) Cloud firewall drops idle Redis cache connections because
django-redis sockets aren't SO_KEEPALIVE-enabled.
- (companion ticket)
AsyncJobStateManager.update_state() treats a
RedisError the same as "state genuinely missing", so a single transient
kills the job with no retry.
Fixing (1) makes the transient much rarer. Fixing (2) makes the remaining
transients survivable. Fix both.
Related
Summary
config/settings/base.pyconfiguresCELERY_BROKER_TRANSPORT_OPTIONSwith a fullsocket_keepalive: True+socket_settings: {TCP_KEEPIDLE: 60, TCP_KEEPINTVL: 10, TCP_KEEPCNT: 9}block (lines 402–413). This protects the Celery→broker connection from the
1-hour idle-connection drop that stateful cloud firewalls impose.
The
CACHESblock on lines 253–264 does not set any keepalive options:redis-pydoes not setSO_KEEPALIVEby default. WithoutSO_KEEPALIVE, thekernel never sends TCP keepalive probes on the socket — regardless of what
net.ipv4.tcp_keepalive_time/tcp_keepalive_intvl/tcp_keepalive_probesare set to at the host level. The host-level sysctls only apply to sockets that
have explicitly enabled
SO_KEEPALIVE.Result: Redis cache connections pooled by
django-rediscan sit idle for morethan the cloud firewall's connection-tracking timeout (typically 3600s on
OpenStack Neutron-based platforms), get silently dropped, and then fail with
ECONNRESETthe next time the pool hands them out.Direct observation
On a production worker, we compared active TCP connections from the celery
worker container to
redis:6379:5 of the 10 connections have no keepalive timer. Those are
django-rediscache connections. The other 5 are Celery result-backend connections which
inherit keepalive from Celery's transport options. The split is ~50/50 at
steady state.
Any of the unprotected 5 can be silently dropped by the cloud firewall after
it sits in the connection pool past the firewall's idle timeout.
Concrete incident
A production async_api job was processing successfully — 507 / 840 images
completed, 0 failures, thousands of detections and classifications already
saved to the database, celery worker logs showing
process_nats_pipeline_resulttasks succeeding at sub-second intervals.
Then one task hit a Redis connection reset:
153ms between the transient error and
_fail_jobbeing called. Other tasks onthe same celery worker had hit Redis successfully milliseconds before and
milliseconds after this event — it was a single connection, not a Redis
outage.
The failure path (
_fail_jobinami/jobs/tasks.py) then:…rendering the job irrecoverable. The hundreds of already-processed images are
still in the database, but the job record says FAILURE with 0 failed images,
which is confusing at best.
The code-path brittleness (
update_statemappingRedisErrortoreturn None,which the caller treats as permanent) is tracked separately — see the companion
ticket on conflation of transient and terminal errors in
AsyncJobStateManager.update_state.This ticket is about the config gap that makes the transient happen in the
first place.
Why the host-level keepalive fix alone isn't sufficient
This deployment already has sysctl-level keepalive tuning applied at the host:
That's the right fix for the cloud firewall issue, but it only takes effect on
sockets where the application has called
setsockopt(SO_KEEPALIVE, 1). TheCelery broker path does this (via
socket_keepaliveinCELERY_BROKER_TRANSPORT_OPTIONS). The django-redis cache path does not.A host-level fix cannot compensate for an application that never opts its
sockets in.
Proposed fix
Add
socket_keepaliveandsocket_keepalive_optionsto the CACHES defaultoptions, mirroring what the Celery broker already does:
django-redispassesSOCKET_KEEPALIVEandSOCKET_KEEPALIVE_OPTIONSthroughto the underlying
redis.Connectionconstructor, which callssetsockopt(SO_KEEPALIVE, 1)and then applies each option inSOCKET_KEEPALIVE_OPTIONSviasetsockopt(SOL_TCP, opt, val).Reference: https://github.com/jazzband/django-redis/blob/master/django_redis/pool.py
and https://redis-py.readthedocs.io/en/stable/connections.html (look for
socket_keepaliveandsocket_keepalive_options).Verification after the fix
After deploying, run inside the celery worker container:
Every established Redis connection should now show a
timer:(keepalive,...)entry. No more bare connections.
Scope clarification
This is one of two bugs that together caused a production job to fail:
django-redissockets aren't SO_KEEPALIVE-enabled.AsyncJobStateManager.update_state()treats aRedisErrorthe same as "state genuinely missing", so a single transientkills the job with no retry.
Fixing (1) makes the transient much rarer. Fixing (2) makes the remaining
transients survivable. Fix both.
Related
setup/apply_keepalive_fix.shin the infra repo (internal) — already appliedthe sysctls at the host level but doesn't (and can't) fix the missing
application-level SO_KEEPALIVE opt-in.