Swartzn/fix/multicursor ring buffer gc#331
Open
swartzn wants to merge 15 commits into
Open
Conversation
…ons to beegfs When restoring an offloaded file, failing to set the data state to available will now return an error which in turn should fail the job. This is important for automatic restores so subsequent requests are rejected instead of constantly spamming job requests.
While integrating Watch with Remote, a few updates were needed: - Detect the meta node ID and include it on the gRPC stream context. - Default to waiting for subscribers to ack the last event received before streaming events.
Protobuf updates required for RST auto syncing including: * Settng restore policies and cooldowns on job requests. * Watch support for event filters.
Bugs in the RST push+stub flow could cause state corruption or local data loss when a job is cancelled under certain race conditions: 1. Remote's UpdateWork() had no terminal state guard. A late-arriving COMPLETED work result (e.g. from Sync journal replay after restart or a gRPC context cancellation race) could trigger job.Complete() on an already-cancelled job, overwriting the CANCELLED state and violating the user's cancel intent. For multipart uploads this typically results in a FAILED job (the multipart was already aborted so finishUpload fails), but the state corruption makes job history confusing and difficult to reason about. For non-multipart uploads this could create a stub file after the job was cancelled, though data loss should not occur since the contents were already synced to the bucket. Fixed by checking job.InTerminalState() before processing. The work result is still persisted for inspection, but no completion logic runs. 2. Sync's gRPC server discarded work results when the work manager returned both a result and an error. This happens when Remote tries to cancel already-COMPLETED work — the manager returns the COMPLETED result alongside an error, but the server returned only the gRPC error. Remote never learned the work was COMPLETED and set the state to UNKNOWN. Fixed by returning the work result without a gRPC error when the manager provides one, so Remote sees the actual state. Assisted-by: Claude:claude-opus-4-6 3. updateRstCfg wrapped the sentinel with %s and only the inner error with %w, so errors.Is(err, ErrJobAlreadyOffloaded) returned false when updateRstConfig failed. That promoted the error to ErrJobFailedPrecondition downstream and tripped the GenerateWorkRequests lock-clear defer, leaving a stubbed-but-unlocked file. Wrap both with %w so the sentinel stays in the unwrap chain and the defer correctly skips this case. Assisted-by: Claude:claude-opus-4-7
Assisted-by: Claude:claude-sonnet-4-6
The subscriber service is split from the gRPC server so optionally the service can be reused with an existing gRPC server.
Wraps the subscriber service adding the ability to dispatch default or event specific functions as events are received. Can be wired to either an existing gRPC server, or used to setup a new one. Optional rate limits can be defined for all users to limit what event types are dispatched within a configurable time window. These limits can be overridden for specific or ranges of user IDs for all or a subset of event types. By default no events for any user are dispatched.
Wire the Watch event dispatcher + subscriber service into Remote, and define a dispatch function.
This ensures consistent behavior when setting file and directory configuration, and lets us avoid repeatedly checking the euid after startup.
By line 219, b.buffer[currentStart] = nil had already cleared the dropped event's slot at line 201, and the function returns &seqID taken from b.buffer[b.start].SeqId was the NEW oldest event's seqID, not the dropped event's. handleV2Connection then logs a misleading droppedSeqID in the warning. Not introduced by this PR but now more prominent given the surrounding ringbuffer changes. Assisted-by: Claude:claude-opus-4-7
Originally the handshake protocol-version check used && instead of || which would only reject when both major and minor version were different, so a 3.0 meta would work with a 2.0 watch. The fix just stops checking the minor version in the handshake so the minor version can be bumped to indicate new fields were added to the end of messages, without breaking compatibility with older Watch versions that should just ignore those fields.
* Fixed wrapped cursor ordering: getOldestAckCursor() now computes ring-buffer distance correctly when end has wrapped, so GC chooses the actual oldest unacknowledged cursor. * Fixed false dropped-event reporting: Now returns nil when GC drops an event with either no subscribers or there was a race on the oldestAckCursor and no event was actually unacknowledged.
Remove redundant conditionals and unreachable code from MultiCursorRingBuffer's AckEvent and searchIndexOfSeqID. Fix or remove inaccurate or stale comments.
c441a4a to
f80bd1f
Compare
ea3eb0b to
af0d0d3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@iamjoemccormick I found some issues in Watch and since they were not directly part of your PR, I create this one.
What does this PR do / why do we need it?
Required for all PRs.
Related Issue(s)
Required when applicable.
Where should the reviewer(s) start reviewing this?
Only required for larger PRs when this may not be immediately obvious.
Are there any specific topics we should discuss before merging?
Not required.
What are the next steps after this PR?
Not required.
Checklist before merging:
Required for all PRs.
When creating a PR these are items to keep in mind that cannot be checked by GitHub actions:
For more details refer to the Go coding standards and the pull request process.