Skip to content

fix(peerclient): resolve base path per-call from live FrameTable#2585

Draft
levb wants to merge 2 commits intomainfrom
lev-fix-switch-path
Draft

fix(peerclient): resolve base path per-call from live FrameTable#2585
levb wants to merge 2 commits intomainfrom
lev-fix-switch-path

Conversation

@levb
Copy link
Copy Markdown
Contributor

@levb levb commented May 7, 2026

After a peer→storage transition, peerSeekable kept opening base against the original (uncompressed) path captured at construction. With the recent compression work, post-transition reads now target a compressed object, so the cached base resolved to a non-existent path.

P2P routing produced the stale binding (it captured a path while the build was still uncompressed); it also contains the fix. peerSeekable now holds (buildID, basic name, base provider, objType) and composes the actual storage path at base-open time using the CompressionType from the live FrameTable. The base seekable is reopened on ct change. No changes outside peerclient.

After a peer→storage transition, peerSeekable kept opening base against
the original (uncompressed) path captured at construction. With the
recent compression work, post-transition reads now target a compressed
object, so the cached base resolved to a non-existent path.

P2P routing produced the stale binding (it captured a path while the
build was still uncompressed); it also contains the fix. peerSeekable
now holds (buildID, basic name, base provider, objType) and composes
the actual storage path at base-open time using the CompressionType from
the live FrameTable. The base seekable is reopened on ct change. No
changes outside peerclient — no cache key, no chunker, no public
storage interface.

Also hoist the transition emit to the top of OpenRangeReader: returning
PeerTransitionedError no longer wastes a base open before the caller
swaps the header and retries.
@cla-bot cla-bot Bot added the cla-signed label May 7, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

❌ 4 Tests Failed:

Tests completed Failed Passed Skipped
2587 4 2583 7
View the full list of 8 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestSandboxMetrics

Flake rate in main: 56.47% (Passed 37 times, Failed 48 times)

Stack Traces | 9.8s run time
=== RUN   TestSandboxMetrics
=== PAUSE TestSandboxMetrics
=== CONT  TestSandboxMetrics
    sandbox_metrics_test.go:44: 
        	Error Trace:	.../api/metrics/sandbox_metrics_test.go:44
        	Error:      	Should NOT be empty, but was 0
        	Test:       	TestSandboxMetrics
--- FAIL: TestSandboxMetrics (9.80s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestTeamMetrics

Flake rate in main: 70.45% (Passed 39 times, Failed 93 times)

Stack Traces | 3.4s run time
=== RUN   TestTeamMetrics
=== PAUSE TestTeamMetrics
=== CONT  TestTeamMetrics
    team_metrics_test.go:61: 
        	Error Trace:	.../api/metrics/team_metrics_test.go:61
        	Error:      	Should be true
        	Test:       	TestTeamMetrics
        	Messages:   	MaxConcurrentSandboxes should be >= 0
--- FAIL: TestTeamMetrics (3.40s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 72.96% (Passed 43 times, Failed 116 times)

Stack Traces | 28.5s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (28.51s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 73.20% (Passed 41 times, Failed 112 times)

Stack Traces | 0.78s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1366}}
Executing command curl in sandbox ixo7aql845x88k1nq69ku
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1367}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1368}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Thu, 07 May 2026 00:51:26 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox ija0uhlvy5dnv8su2kffz
Executing command curl in sandbox iyuovfbwsf8mcpyqkaent
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (0.78s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxFilesystemPauseResumeIntegrity

Flake rate in main: 49.33% (Passed 38 times, Failed 37 times)

Stack Traces | 0s run time
=== RUN   TestSandboxFilesystemPauseResumeIntegrity
=== PAUSE TestSandboxFilesystemPauseResumeIntegrity
=== CONT  TestSandboxFilesystemPauseResumeIntegrity
--- FAIL: TestSandboxFilesystemPauseResumeIntegrity (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxFilesystemPauseResumeIntegrity/scattered_write_hash_survives_pause

Flake rate in main: 49.33% (Passed 38 times, Failed 37 times)

Stack Traces | 2.29s run time
=== RUN   TestSandboxFilesystemPauseResumeIntegrity/scattered_write_hash_survives_pause
=== PAUSE TestSandboxFilesystemPauseResumeIntegrity/scattered_write_hash_survives_pause
=== CONT  TestSandboxFilesystemPauseResumeIntegrity/scattered_write_hash_survives_pause
Executing command bash in sandbox iinq3a1g2iqdp3fq8d4ma (user: root)
    filesystem_pause_resume_integrity_test.go:43: Command [bash] output: event:{start:{pid:1263}}
Executing command bash in sandbox ifnvo4b50q2ewiy6nqmzy (user: root)
    filesystem_pause_resume_integrity_test.go:43: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    filesystem_pause_resume_integrity_test.go:43: Command [bash] completed successfully in sandbox iiovypddy28yh7nit39fr
Executing command bash in sandbox iinq3a1g2iqdp3fq8d4ma (user: root)
    filesystem_pause_resume_integrity_test.go:43: Command [bash] output: event:{start:{pid:1269}}
    filesystem_pause_resume_integrity_test.go:43: 
        	Error Trace:	.../tests/orchestrator/filesystem_pause_resume_integrity_test.go:114
        	            				.../tests/orchestrator/filesystem_pause_resume_integrity_test.go:132
        	            				.../tests/orchestrator/filesystem_pause_resume_integrity_test.go:43
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox iiovypddy28yh7nit39fr: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxFilesystemPauseResumeIntegrity/scattered_write_hash_survives_pause
--- FAIL: TestSandboxFilesystemPauseResumeIntegrity/scattered_write_hash_survives_pause (2.29s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 60.50% (Passed 47 times, Failed 72 times)

Stack Traces | 75.7s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (75.74s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 64.08% (Passed 37 times, Failed 66 times)

Stack Traces | 31.1s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1262}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 185 MB\nFree memory before tmpfs mount: 799 MB\nMemory to use in integrity test (80% of free, min 64MB): 639 MB\n"}}
Executing command bash in sandbox idxswocovtsjjno41jubv (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"639+0 records in\n639+0 records out\n670040064 bytes (670 MB, 639 MiB) copied, 4.03567 s, 166 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"C"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=639\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 3.99\n\tPercent of CPU this job got: 98%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:04.04\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set size (kbytes): 2628\n\tAver"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"age resident set size (kb"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ytes): 0\n\tMajor (requiring I/O) page faults: 2\n\tMinor (reclaiming a frame) page fa"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ults: 347\n\tVoluntary context switches: 3\n\tInvoluntary context switches: 100\n\tS"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"waps: 0\n\tFile system"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" inputs: 176\n\tFil"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e system output"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"s: 0\n\tSocket m"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"essages sent: 0\n\tSocket messages received: 0\n\tSign"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"als delivered: 0\n\tPage size (bytes): 4096\n\tExit s"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"tatus: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 832 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox idd53wstuhqsqyutrjeyq
Executing command bash in sandbox idd53wstuhqsqyutrjeyq (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1279}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"f939a9180e35c236d6e4f41e905f8ee9af9ecf285722fd59b426cf74647b4ffc\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox idd53wstuhqsqyutrjeyq
Executing command bash in sandbox idd53wstuhqsqyutrjeyq (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1282}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox idd53wstuhqsqyutrjeyq: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (31.06s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The removal of nil checks for the uploaded atomic pointer in OpenRangeReader and tryPeer introduces potential nil pointer dereferences. A transition to storage occurring during a peer read attempt may be missed, causing the base storage to be opened with a stale compression type; re-checking the transition state after the peer attempt is necessary to ensure the caller refreshes the compression state correctly.

// Emit at most once per peerSeekable so V3 builds (no V4 to upgrade to)
// don't loop against this error. No peer call, no base open — caller
// retries with the live compression type already in the new FrameTable.
if s.uploaded.Load() && s.transitionEmitted.CompareAndSwap(false, true) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The defensive check for a nil uploaded atomic pointer was removed in this refactoring. While the provider is typically initialized with this atomic, the original implementation included this check to prevent panics in edge cases where the routing might not provide it. Restoring the check ensures the code remains robust against variations in the initialization path.

Suggested change
if s.uploaded.Load() && s.transitionEmitted.CompareAndSwap(false, true) {
if s.uploaded != nil && s.uploaded.Load() && s.transitionEmitted.CompareAndSwap(false, true) {

Comment on lines +134 to +136
if res.hit {
return res.value, err
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

A transition to storage that occurs during the tryPeer call will be missed, potentially leading to path resolution errors. If the peer signals a transition during the read attempt, tryPeer returns a non-hit result, and the code proceeds to open the base storage using the stale compression type from the current frameTable. Re-checking the transition state after the peer attempt ensures the caller is correctly prompted to refresh its header and compression state when a transition is detected mid-call.

	if res.hit {
		return res.value, err
	}

	if s.uploaded != nil && s.uploaded.Load() && s.transitionEmitted.CompareAndSwap(false, true) {
		return nil, &storage.PeerTransitionedError{}
	}

Comment thread packages/orchestrator/pkg/sandbox/template/peerclient/storage.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants