fix(uffd): re-read page state inside worker goroutine under settleRequests.RLock#2514
Closed
ValentaTomas wants to merge 1 commit intofeat/uffd-page-tracker-and-remove-eventsfrom
Closed
Conversation
…uests.RLock The Serve() loop previously read pageTracker state and captured `source = u.src` in the parent loop, then dispatched a worker goroutine. A REMOVE event for the same page that arrived between the state read and the worker actually acquiring settleRequests.RLock() would silently leave the worker with a stale `source = u.src` snapshot. The worker would then UFFDIO_COPY src bytes into a page the kernel had just MADV_DONTNEED'd, leaving pageTracker == removed and the kernel page mapped with stale src data — and observably deadlocking parent madvise() in the orchestrator unit-test suite. Move the state lookup and source capture inside the goroutine, after RLock(). The read+act+commit sequence is now atomic with respect to the REMOVE batch (which takes settleRequests.Lock()). # Conflicts: # packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 8a88dbf. Bugbot is set up for automated code reviews on this repo. Configure here. |
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is the fix side of a stacked pair targeting
main. It dependson PR #2513 (
feat/uffd-page-tracker-and-remove-events), which lifts theminimum production subset of UFFD code from #1896 plus the test harness
and the deterministic race tests that demonstrate this bug. Merge this
stack as a unit: PR #2513 first, then this PR. Otherwise PR #2513's
TestStaleSourceRaceMissingAndRemovefails onmainand CI goes red.The bug
The
Serve()loop on PR #2513 (and onfeat/free-page-reporting) readspageTrackerstate and capturessource = u.srcin the parent loop,then dispatches a worker goroutine. A
UFFD_EVENT_REMOVEfor the samepage that arrives in the window between the parent-loop state read and
the worker actually acquiring
settleRequests.RLock()silently leavesthe worker with a stale
source = u.srcsnapshot. The worker thenUFFDIO_COPYsu.srcbytes into a page the kernel has justMADV_DONTNEED'd, leavingpageTracker == removedbut the kernel pagere-mapped with stale src data. Observably this also deadlocks parent
madvise()in the orchestrator unit test suite under tight enoughinterleavings.
The fix
Move the
pageTracker.get(addr)lookup and thesource = u.srccaptureinside the worker goroutine, after acquiring
settleRequests.RLock(). Now the read+act+commit sequence(state lookup → faultPage → setState) is atomic with respect to any
concurrent REMOVE batch (which takes
settleRequests.Lock()). One filechanged, +30 / -21 lines.
Verification
Run on this branch (sudo, Linux, Go 1.25.9):
removedstate, and deterministic race tests #2513 pass:TestStaleSourceRaceMissingAndRemove(4k + hugepage) — wasgot 0xc3pre-fix; now zero-fills as expected.TestNoMadviseDeadlockWithInflightCopy(4k + hugepage).TestFaultedShortCircuitOrdering(4k + hugepage).0.302stest wallclock /1.6stotal.-count=20 -timeout=30ssoak across all three: passes in5.5s.pkg/sandbox/uffd/userfaultfd/...suite (incl.TestRemove*,TestRemoveThenWriteGated,TestWriteThenRemoveGated,TestMissing*,TestMissingWrite*,TestAsyncWriteProtection,TestSerial*,TestParallel*): all pass post-fix in0.93s.Related
removedstate, and deterministic race tests #2513 (feat/uffd-page-tracker-and-remove-events).This PR is what makes its
TestStaleSourceRaceMissingAndRemoveassertion
got 0xc3 (sentinel) → want 0flip green.feat/free-page-reporting). When thisstack merges, uffd: add support for UFFD_EVENT_REMOVE events #1896 will rebase on top and drop both lifted commits
and the original in-tree fix.
c8b3d63in the review thread.Stacking
feat/uffd-page-tracker-and-remove-events(PR feat(uffd): add REMOVE-event handling, pageTrackerremovedstate, and deterministic race tests #2513).GitHub will render this PR's diff as the single-file production fix
on top of PR feat(uffd): add REMOVE-event handling, pageTracker
removedstate, and deterministic race tests #2513's lift + harness + tests.removedstate, and deterministic race tests #2513 merges tomain, this PR's base will auto-targetmain.Test plan
go build ./...clean.go vet ./...clean.sudo go test -run 'Race|Deadlock|Ordering' -count=1— all pass.sudo go test -run 'Race|Deadlock|Ordering' -count=20 -timeout=30s— soak passes (5.5s).sudo go test ./pkg/sandbox/uffd/userfaultfd/...— full suite passes.