Skip to content

Fix overlay neighbor replay for inactive endpoints#322

Open
sc-zenokerr wants to merge 3 commits into
masterfrom
fix/321/ignore-inactive-overlay-replay
Open

Fix overlay neighbor replay for inactive endpoints#322
sc-zenokerr wants to merge 3 commits into
masterfrom
fix/321/ignore-inactive-overlay-replay

Conversation

@sc-zenokerr
Copy link
Copy Markdown
Contributor

Summary

  • skip overlay neighbor replay for inactive endpoints during ensure/reconciliation
  • ignore inactive endpoint PUT events in the overlay listener
  • add regression coverage for active and inactive endpoints sharing the same target IP

Closes #321

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents overlay ARP/FDB neighbor state from being (re)applied for endpoints marked Active=false, aligning overlay neighbor replay behavior with existing traffic selection rules (which already ignore inactive endpoints). This mitigates incidents where a leaked inactive endpoint sharing the same target IP as an active endpoint can disrupt connectivity during reconciliation or watch-driven replay.

Changes:

  • Skip neighbor ensure/replay for inactive endpoints during network reconciliation (EnsureEndpointsNeigh) and in the endpoint watch listener (PUT events).
  • Add a shared helper (shouldReplayEndpointNeigh) to centralize the “eligible for neighbor replay” rule.
  • Add regression tests covering both reconciliation and watch behavior when inactive endpoints are present (including the duplicate target IP scenario).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
network/overlay/neigh.go Filters inactive endpoints out of neighbor ensure/add paths via a shared helper.
network/overlay/neigh_test.go Adds coverage ensuring reconciliation ignores inactive endpoints (including same target IP as an active one).
network/overlay/listener.go Ignores PUT events for inactive endpoints so they aren’t replayed into overlay neighbor state.
network/overlay/listener_test.go Extends listener tests to cover active vs inactive PUT handling and updates context usage/imports accordingly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sc-zenokerr sc-zenokerr force-pushed the fix/321/ignore-inactive-overlay-replay branch from aa973e4 to 4970781 Compare May 5, 2026 15:45
@sc-zenokerr sc-zenokerr force-pushed the fix/321/ignore-inactive-overlay-replay branch from 4970781 to 2312f3c Compare May 5, 2026 15:53
@sc-zenokerr sc-zenokerr requested review from EtienneM and SCKevinO May 5, 2026 16:00
Copy link
Copy Markdown
Member

@EtienneM EtienneM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: do we agree that it this fix ensures that a SAND passive endpoint does not interfere with the good working of database core? But the fact that we leak these endpoints is not addressed here?

Comment thread network/overlay/neigh.go Outdated
Comment on lines +116 to +118
func shouldReplayEndpointNeigh(endpoint types.Endpoint) bool {
return endpoint.Active
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: do we really need a function for that? It feels like a useless indirection. I would remove it for more clarity

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did think about this, and hesitated. In the end I decided to keep it. I'll remove it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in ae33747

Comment thread network/overlay/neigh.go Outdated
Comment on lines +24 to +27
if !shouldReplayEndpointNeigh(endpoint) {
log.Info("skip inactive endpoint ARP/FDB replay")
continue
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This is already checked in the call to AddEndpointNeigh, we don't really need to check again here

Suggested change
if !shouldReplayEndpointNeigh(endpoint) {
log.Info("skip inactive endpoint ARP/FDB replay")
continue
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in done in ae33747

Comment thread network/overlay/neigh.go
@@ -32,7 +35,12 @@ func (m manager) EnsureEndpointsNeigh(ctx context.Context, network types.Network
}

func (m manager) AddEndpointNeigh(ctx context.Context, network types.Network, endpoint types.Endpoint) error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: while you are here, it seems that AddEndpointNeigh can be private

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't make this private because overlay/manager implements the netmanager.NetManager interface which defines the function

AddEndpointNeigh(context.Context, types.Network, types.Endpoint) error

Comment thread network/overlay/neigh.go Outdated
ctx = logger.ToCtx(ctx, logger.Get(ctx).WithField("neighbor_action", "add"))
ctx, log := logger.WithFieldToCtx(ctx, "neighbor_action", "add")
if !shouldReplayEndpointNeigh(endpoint) {
log.Info("skip inactive endpoint ARP/FDB replay")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I am not familiar with SAND wording, but it feels odd that the log message mentions "ARP/FDB replay" in a function named AddEndpointNeigh. Is it really "ARP/FDB replay" that we skip? Isn't it something else?

Suggested change
log.Info("skip inactive endpoint ARP/FDB replay")
log.Info("Skip inactive endpoint ARP/FDB replay")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log message updated in ae33747

Regarding the "ARP/FDB replay" as part of the message. both calling functions mention "ARP/FDB" in the case of error. It makes sense to me to keep the context in the log message if the inactive endpoint is skipped during these loops.

@sc-zenokerr
Copy link
Copy Markdown
Contributor Author

sc-zenokerr commented May 25, 2026

Changes in response to review pushed in ae33747

On the scope question:

question: do we agree that it this fix ensures that a SAND passive endpoint does not interfere with the good working of database core? But the fact that we leak these endpoints is not addressed here?

yes, this fix is meant to prevent passive/inactive endpoints from being replayed into overlay neighbor state and interfering with active traffic. It does not address the endpoint leak itself.

@sc-zenokerr sc-zenokerr requested a review from EtienneM May 25, 2026 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Do not replay overlay neighbor state for inactive endpoints

3 participants