Skip to content

Swartzn/fix/multicursor ring buffer gc#331

Open
swartzn wants to merge 15 commits into
iamjoe/feat/rst-auto-syncfrom
swartzn/fix/multicursor-ring-buffer-gc
Open

Swartzn/fix/multicursor ring buffer gc#331
swartzn wants to merge 15 commits into
iamjoe/feat/rst-auto-syncfrom
swartzn/fix/multicursor-ring-buffer-gc

Conversation

@swartzn

@swartzn swartzn commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@iamjoemccormick I found some issues in Watch and since they were not directly part of your PR, I create this one.

What does this PR do / why do we need it?

Required for all PRs.

Related Issue(s)

Required when applicable.

Where should the reviewer(s) start reviewing this?

Only required for larger PRs when this may not be immediately obvious.

Are there any specific topics we should discuss before merging?

Not required.

What are the next steps after this PR?

Not required.

Checklist before merging:

Required for all PRs.

When creating a PR these are items to keep in mind that cannot be checked by GitHub actions:

  • Documentation:
    • Does developer documentation (code comments, readme, etc.) need to be added or updated?
    • Does the user documentation need to be expanded or updated for this change?
  • Testing:
    • Does this functionality require changing or adding new unit tests?
    • Does this functionality require changing or adding new integration tests?
  • Git Hygiene:

For more details refer to the Go coding standards and the pull request process.

iamjoemccormick and others added 15 commits May 27, 2026 16:58
…ons to beegfs

When restoring an offloaded file, failing to set the data state to available will now return an
error which in turn should fail the job. This is important for automatic restores so subsequent
requests are rejected instead of constantly spamming job requests.
While integrating Watch with Remote, a few updates were needed:

- Detect the meta node ID and include it on the gRPC stream context.
- Default to waiting for subscribers to ack the last event received before streaming events.
Protobuf updates required for RST auto syncing including:

* Settng restore policies and cooldowns on job requests.
* Watch support for event filters.
Bugs in the RST push+stub flow could cause state corruption or local data loss when a job is
cancelled under certain race conditions:

1. Remote's UpdateWork() had no terminal state guard. A late-arriving COMPLETED work result (e.g.
   from Sync journal replay after restart or a gRPC context cancellation race) could trigger
   job.Complete() on an already-cancelled job, overwriting the CANCELLED state and violating the
   user's cancel intent. For multipart uploads this typically results in a FAILED job (the multipart
   was already aborted so finishUpload fails), but the state corruption makes job history confusing
   and difficult to reason about. For non-multipart uploads this could create a stub file after the
   job was cancelled, though data loss should not occur since the contents were already synced to
   the bucket. Fixed by checking job.InTerminalState() before processing. The work result is still
   persisted for inspection, but no completion logic runs.

2. Sync's gRPC server discarded work results when the work manager returned both a result and an
   error. This happens when Remote tries to cancel already-COMPLETED work — the manager returns the
   COMPLETED result alongside an error, but the server returned only the gRPC error. Remote never
   learned the work was COMPLETED and set the state to UNKNOWN. Fixed by returning the work result
   without a gRPC error when the manager provides one, so Remote sees the actual state.

Assisted-by: Claude:claude-opus-4-6

3. updateRstCfg wrapped the sentinel with %s and only the inner error with %w, so errors.Is(err,
   ErrJobAlreadyOffloaded) returned false when updateRstConfig failed. That promoted the error to
   ErrJobFailedPrecondition downstream and tripped the GenerateWorkRequests lock-clear defer,
   leaving a stubbed-but-unlocked file. Wrap both with %w so the sentinel stays in the unwrap chain
   and the defer correctly skips this case.

Assisted-by: Claude:claude-opus-4-7
Assisted-by: Claude:claude-sonnet-4-6
The subscriber service is split from the gRPC server so optionally the service can be reused with an
existing gRPC server.
Wraps the subscriber service adding the ability to dispatch default or event specific functions as
events are received. Can be wired to either an existing gRPC server, or used to setup a new one.

Optional rate limits can be defined for all users to limit what event types are dispatched within a
configurable time window. These limits can be overridden for specific or ranges of user IDs for all
or a subset of event types. By default no events for any user are dispatched.
Wire the Watch event dispatcher + subscriber service into Remote, and define a dispatch function.
This ensures consistent behavior when setting file and directory configuration, and lets us avoid
repeatedly checking the euid after startup.
By line 219, b.buffer[currentStart] = nil had already cleared the dropped event's slot at line 201,
and the function returns &seqID taken from b.buffer[b.start].SeqId was the NEW oldest event's seqID,
not the dropped event's. handleV2Connection then logs a misleading droppedSeqID in the warning. Not
introduced by this PR but now more prominent given the surrounding ringbuffer changes.

Assisted-by: Claude:claude-opus-4-7
Originally the handshake protocol-version check used && instead of || which would only reject when
both major and minor version were different, so a 3.0 meta would work with a 2.0 watch.

The fix just stops checking the minor version in the handshake so the minor version can be bumped to
indicate new fields were added to the end of messages, without breaking compatibility with older
Watch versions that should just ignore those fields.
* Fixed wrapped cursor ordering: getOldestAckCursor() now computes ring-buffer distance
  correctly when end has wrapped, so GC chooses the actual oldest unacknowledged cursor.
* Fixed false dropped-event reporting: Now returns nil when GC drops an event with either
  no subscribers or there was a race on the oldestAckCursor and no event was actually
  unacknowledged.
Remove redundant conditionals and unreachable code from MultiCursorRingBuffer's AckEvent and searchIndexOfSeqID.
Fix or remove inaccurate or stale comments.
@swartzn swartzn requested a review from iamjoemccormick June 9, 2026 12:01
@swartzn swartzn requested a review from a team as a code owner June 9, 2026 12:01
@iamjoemccormick iamjoemccormick force-pushed the iamjoe/feat/rst-auto-sync branch from c441a4a to f80bd1f Compare June 15, 2026 20:47
@iamjoemccormick iamjoemccormick added bug Something isn't working backport labels Jun 16, 2026
@iamjoemccormick iamjoemccormick force-pushed the iamjoe/feat/rst-auto-sync branch 3 times, most recently from ea3eb0b to af0d0d3 Compare June 20, 2026 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants