Fix overlay neighbor replay for inactive endpoints#322
Conversation
There was a problem hiding this comment.
Pull request overview
This PR prevents overlay ARP/FDB neighbor state from being (re)applied for endpoints marked Active=false, aligning overlay neighbor replay behavior with existing traffic selection rules (which already ignore inactive endpoints). This mitigates incidents where a leaked inactive endpoint sharing the same target IP as an active endpoint can disrupt connectivity during reconciliation or watch-driven replay.
Changes:
- Skip neighbor ensure/replay for inactive endpoints during network reconciliation (
EnsureEndpointsNeigh) and in the endpoint watch listener (PUT events). - Add a shared helper (
shouldReplayEndpointNeigh) to centralize the “eligible for neighbor replay” rule. - Add regression tests covering both reconciliation and watch behavior when inactive endpoints are present (including the duplicate target IP scenario).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| network/overlay/neigh.go | Filters inactive endpoints out of neighbor ensure/add paths via a shared helper. |
| network/overlay/neigh_test.go | Adds coverage ensuring reconciliation ignores inactive endpoints (including same target IP as an active one). |
| network/overlay/listener.go | Ignores PUT events for inactive endpoints so they aren’t replayed into overlay neighbor state. |
| network/overlay/listener_test.go | Extends listener tests to cover active vs inactive PUT handling and updates context usage/imports accordingly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
aa973e4 to
4970781
Compare
4970781 to
2312f3c
Compare
EtienneM
left a comment
There was a problem hiding this comment.
question: do we agree that it this fix ensures that a SAND passive endpoint does not interfere with the good working of database core? But the fact that we leak these endpoints is not addressed here?
| func shouldReplayEndpointNeigh(endpoint types.Endpoint) bool { | ||
| return endpoint.Active | ||
| } |
There was a problem hiding this comment.
suggestion: do we really need a function for that? It feels like a useless indirection. I would remove it for more clarity
There was a problem hiding this comment.
I did think about this, and hesitated. In the end I decided to keep it. I'll remove it
| if !shouldReplayEndpointNeigh(endpoint) { | ||
| log.Info("skip inactive endpoint ARP/FDB replay") | ||
| continue | ||
| } |
There was a problem hiding this comment.
suggestion: This is already checked in the call to AddEndpointNeigh, we don't really need to check again here
| if !shouldReplayEndpointNeigh(endpoint) { | |
| log.Info("skip inactive endpoint ARP/FDB replay") | |
| continue | |
| } |
| @@ -32,7 +35,12 @@ func (m manager) EnsureEndpointsNeigh(ctx context.Context, network types.Network | |||
| } | |||
|
|
|||
| func (m manager) AddEndpointNeigh(ctx context.Context, network types.Network, endpoint types.Endpoint) error { | |||
There was a problem hiding this comment.
suggestion: while you are here, it seems that AddEndpointNeigh can be private
There was a problem hiding this comment.
We can't make this private because overlay/manager implements the netmanager.NetManager interface which defines the function
sand/network/netmanager/netmanager.go
Line 19 in 35ebea4
| ctx = logger.ToCtx(ctx, logger.Get(ctx).WithField("neighbor_action", "add")) | ||
| ctx, log := logger.WithFieldToCtx(ctx, "neighbor_action", "add") | ||
| if !shouldReplayEndpointNeigh(endpoint) { | ||
| log.Info("skip inactive endpoint ARP/FDB replay") |
There was a problem hiding this comment.
question: I am not familiar with SAND wording, but it feels odd that the log message mentions "ARP/FDB replay" in a function named AddEndpointNeigh. Is it really "ARP/FDB replay" that we skip? Isn't it something else?
| log.Info("skip inactive endpoint ARP/FDB replay") | |
| log.Info("Skip inactive endpoint ARP/FDB replay") |
There was a problem hiding this comment.
Log message updated in ae33747
Regarding the "ARP/FDB replay" as part of the message. both calling functions mention "ARP/FDB" in the case of error. It makes sense to me to keep the context in the log message if the inactive endpoint is skipped during these loops.
|
Changes in response to review pushed in ae33747 On the scope question:
yes, this fix is meant to prevent passive/inactive endpoints from being replayed into overlay neighbor state and interfering with active traffic. It does not address the endpoint leak itself. |
Summary
Closes #321