[Experimental] Alternate signal stack to make sandbox stack overflows recoverable#398
Draft
bushidocodes wants to merge 1 commit into
Draft
[Experimental] Alternate signal stack to make sandbox stack overflows recoverable#398bushidocodes wants to merge 1 commit into
bushidocodes wants to merge 1 commit into
Conversation
Each sandbox runs on its own mmap'd native stack with a PROT_NONE guard page below it. The signal handlers were registered without an alternate signal stack, so when a sandbox overflows its stack and faults on the guard page there is no room on the exhausted stack to deliver SIGSEGV -- the kernel kills the whole runtime instead of trapping just that sandbox. The recovery machinery already exists (current_sandbox_start sigsetjmps; the handler siglongjmps back via current_sandbox_trap); it simply cannot run because the signal cannot be delivered. Register a per-worker alternate signal stack and flag the synchronous fault handlers (SIGSEGV, SIGFPE) SA_ONSTACK so they run on it, leaving an overflow recoverable. The asynchronous preemption signals (SIGALRM, SIGUSR1) are intentionally left off the alternate stack: they context- switch away rather than returning, which does not compose with an alternate stack, so isolating SA_ONSTACK to the fault handlers keeps the preemption path untouched. Marked experimental: the mechanism is validated by a standalone test that reproduces the runtime's exact primitives (guard-paged mmap stack via makecontext/swapcontext, sigsetjmp recovery, handler siglongjmp) -- which crashes without the alt stack and recovers with it -- and a normal preemptive workload shows no regression, but a real in-runtime overflow was not exercised end to end (no existing module recurses deeply enough, and building one needs the module toolchain). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
|
This is interesting! We can definitely circle back to this later. So, yes, agreed to keep it open for now. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Each sandbox runs on its own
mmap'd native stack with aPROT_NONEguard page below it (wasm_stack.h). The signal handlers are registered with onlySA_SIGINFO | SA_RESTART— nosigaltstack. So when a sandbox overflows its stack and faults on the guard page, there's no room on the exhausted stack to deliver SIGSEGV, and the kernel kills the whole runtime instead of trapping just that sandbox. The recovery machinery already exists (current_sandbox_startdoessigsetjmp; the handler doescurrent_sandbox_trap→siglongjmpback to a valid part of the stack) — it just can't run because the signal can't be delivered. (Addresses #290.)Change
sigaltstack) inworker_thread_main, before unmasking the fault signals.SA_ONSTACK, so they're delivered on the alt stack even when the sandbox stack is exhausted → the existingsiglongjmprecovery runs.SA_ONSTACKto the fault handlers keeps the preemption path untouched.3 files, +44 lines.
Verification
mmapstack run viamakecontext/swapcontext,sigsetjmprecovery, handlersiglongjmp): without the alt stack the process is killed by SIGSEGV (exit 139); with it, the overflow isRECOVERED(exit 0).200s, runtime alive, no crashes; confirms the alt stack + selectiveSA_ONSTACKdoesn't disturb preemption.Why experimental / not fully verified
A real in-
sledgertoverflow was not exercised end to end: no prebuilt module recurses deeply enough to exhaust the 512 KB native stack (per #290 this is hard to trigger — wasm locals live in linear memory, so it needs deep recursion), and building a custom deeply-recursive module goes through the libsledge.wasm.sotoolchain. The change is sound by composition (the recovery path pre-exists and already handles traps like linear-memory-OOB SIGSEGV → 500; this only adds alt-stack delivery, proven by the mechanism test), but I'd want a real overflow test before marking it non-experimental.Relates to #290.
🤖 Generated with Claude Code