Skip to content

Fix on_up() destroying healthy pool after replace-with-same-IP#879

Draft
bitpathfinder wants to merge 1 commit intoscylladb:masterfrom
bitpathfinder:fix/scylladb-833-on-up-stale-host
Draft

Fix on_up() destroying healthy pool after replace-with-same-IP#879
bitpathfinder wants to merge 1 commit intoscylladb:masterfrom
bitpathfinder:fix/scylladb-833-on-up-stale-host

Conversation

@bitpathfinder
Copy link
Copy Markdown

@bitpathfinder bitpathfinder commented May 7, 2026

Summary

When a node is replaced with the same IP, the driver receives both TOPOLOGY_CHANGE NEW_NODE and STATUS_CHANGE UP events. The NEW_NODE handler runs first, replacing the old host and establishing a new pool via on_add. The STATUS_CHANGE UP handler fires later (after a random 0-2s delay) with a stale reference to the old host object. Because Host.__eq__/__hash__ are endpoint-based, the stale on_up() tears down the new host's pool, causing a brief window where queries fail with NoHostAvailable.

Fix

Two early-return guards added at the top of Cluster.on_up():

  1. Stale host check: If the host object has been replaced in metadata (different identity, same endpoint) and the new host is already up, skip processing.
  2. Healthy pool check: If a non-shutdown pool already exists for this host (established by on_add), mark the host as up and skip the teardown/rebuild cycle.

Fixes: SCYLLADB-833

@bitpathfinder bitpathfinder requested review from Copilot and dkropachev and removed request for dkropachev May 7, 2026 16:56
@bitpathfinder bitpathfinder changed the title Fix on_up() destroying healthy pool after replace-with-same-IP (SCYLLADB-833) Fix on_up() destroying healthy pool after replace-with-same-IP May 7, 2026
@bitpathfinder bitpathfinder force-pushed the fix/scylladb-833-on-up-stale-host branch from c1f7d6b to 582e86d Compare May 7, 2026 17:00
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses SCYLLADB-833 where Cluster.on_up() can be invoked with a stale Host instance after a replace-with-same-IP, causing the driver to tear down a healthy pool and briefly fail queries.

Changes:

  • Add early-return guards in Cluster.on_up() to skip handling when the Host reference is stale or when a healthy pool already exists.
  • Add unit tests covering stale-host and healthy-pool scenarios to prevent regressions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
cassandra/cluster.py Adds on_up() guard clauses to avoid tearing down an already-healthy pool when handling stale host references.
tests/unit/test_cluster.py Adds unit tests validating the new on_up() early-return behavior for stale hosts and pre-existing healthy pools.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cassandra/cluster.py
Comment thread cassandra/cluster.py Outdated
When a node is replaced with the same IP, the driver receives both
TOPOLOGY_CHANGE NEW_NODE and STATUS_CHANGE UP events. The NEW_NODE
handler runs first, replacing the old host and establishing a new pool.
The STATUS_CHANGE UP handler fires later with a stale reference to the
old host object. Because Host.__eq__/__hash__ are endpoint-based, the
stale on_up() tears down the new host's pool, causing a brief window
where queries fail with NoHostAvailable.

Add two guards at the top of on_up():
1. If the host has been replaced in metadata (different object, same
   endpoint, new host already up), skip processing.
2. If a healthy (non-shutdown) pool already exists for this host,
   call set_up() and skip the teardown/rebuild cycle.

Both guards reset _currently_handling_node_up under host.lock,
consistent with the existing cleanup paths.

Refs: SCYLLADB-833
@bitpathfinder bitpathfinder force-pushed the fix/scylladb-833-on-up-stale-host branch from 582e86d to c9b8e5e Compare May 7, 2026 17:05
@bitpathfinder bitpathfinder self-assigned this May 7, 2026
@bitpathfinder bitpathfinder marked this pull request as ready for review May 7, 2026 17:16
@bitpathfinder bitpathfinder marked this pull request as draft May 7, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants