Skip to content

[Experimental] Alternate signal stack to make sandbox stack overflows recoverable#398

Draft
bushidocodes wants to merge 1 commit into
masterfrom
experimental/sigaltstack-recoverable-overflow
Draft

[Experimental] Alternate signal stack to make sandbox stack overflows recoverable#398
bushidocodes wants to merge 1 commit into
masterfrom
experimental/sigaltstack-recoverable-overflow

Conversation

@bushidocodes

Copy link
Copy Markdown
Contributor

Draft / Experimental. The mechanism is validated, but a real in-runtime sandbox overflow has not been exercised end to end (see Verification). Opening for discussion before promoting.

Problem

Each sandbox runs on its own mmap'd native stack with a PROT_NONE guard page below it (wasm_stack.h). The signal handlers are registered with only SA_SIGINFO | SA_RESTARTno sigaltstack. So when a sandbox overflows its stack and faults on the guard page, there's no room on the exhausted stack to deliver SIGSEGV, and the kernel kills the whole runtime instead of trapping just that sandbox. The recovery machinery already exists (current_sandbox_start does sigsetjmp; the handler does current_sandbox_trapsiglongjmp back to a valid part of the stack) — it just can't run because the signal can't be delivered. (Addresses #290.)

Change

  • Each worker registers a per-thread alternate signal stack (sigaltstack) in worker_thread_main, before unmasking the fault signals.
  • The synchronous fault signals SIGSEGV and SIGFPE are registered SA_ONSTACK, so they're delivered on the alt stack even when the sandbox stack is exhausted → the existing siglongjmp recovery runs.
  • SIGALRM / SIGUSR1 are intentionally left off the alt stack. They context-switch away (non-local exit) rather than returning, which doesn't compose with an alternate stack; isolating SA_ONSTACK to the fault handlers keeps the preemption path untouched.

3 files, +44 lines.

Verification

  • Mechanism proof — a standalone test reproducing the runtime's exact primitives (guard-paged mmap stack run via makecontext/swapcontext, sigsetjmp recovery, handler siglongjmp): without the alt stack the process is killed by SIGSEGV (exit 139); with it, the overflow is RECOVERED (exit 0).
  • No regression — normal workload (empty, EDF + preemption, 20,000 requests) → 20,000/20,000 200s, runtime alive, no crashes; confirms the alt stack + selective SA_ONSTACK doesn't disturb preemption.
  • Builds clean.

Why experimental / not fully verified

A real in-sledgert overflow was not exercised end to end: no prebuilt module recurses deeply enough to exhaust the 512 KB native stack (per #290 this is hard to trigger — wasm locals live in linear memory, so it needs deep recursion), and building a custom deeply-recursive module goes through the libsledge .wasm.so toolchain. The change is sound by composition (the recovery path pre-exists and already handles traps like linear-memory-OOB SIGSEGV → 500; this only adds alt-stack delivery, proven by the mechanism test), but I'd want a real overflow test before marking it non-experimental.

Relates to #290.

🤖 Generated with Claude Code

Each sandbox runs on its own mmap'd native stack with a PROT_NONE guard
page below it. The signal handlers were registered without an alternate
signal stack, so when a sandbox overflows its stack and faults on the
guard page there is no room on the exhausted stack to deliver SIGSEGV --
the kernel kills the whole runtime instead of trapping just that sandbox.
The recovery machinery already exists (current_sandbox_start sigsetjmps;
the handler siglongjmps back via current_sandbox_trap); it simply cannot
run because the signal cannot be delivered.

Register a per-worker alternate signal stack and flag the synchronous
fault handlers (SIGSEGV, SIGFPE) SA_ONSTACK so they run on it, leaving an
overflow recoverable. The asynchronous preemption signals (SIGALRM,
SIGUSR1) are intentionally left off the alternate stack: they context-
switch away rather than returning, which does not compose with an
alternate stack, so isolating SA_ONSTACK to the fault handlers keeps the
preemption path untouched.

Marked experimental: the mechanism is validated by a standalone test that
reproduces the runtime's exact primitives (guard-paged mmap stack via
makecontext/swapcontext, sigsetjmp recovery, handler siglongjmp) -- which
crashes without the alt stack and recovers with it -- and a normal
preemptive workload shows no regression, but a real in-runtime overflow
was not exercised end to end (no existing module recurses deeply enough,
and building one needs the module toolchain).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@emil916

emil916 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

This is interesting! We can definitely circle back to this later. So, yes, agreed to keep it open for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants