[Experimental] Alternate signal stack to make sandbox stack overflows recoverable by bushidocodes · Pull Request #398 · gwsystems/sledge-serverless-framework

bushidocodes · 2026-06-17T00:55:43Z

Draft / Experimental. The mechanism is validated, but a real in-runtime sandbox overflow has not been exercised end to end (see Verification). Opening for discussion before promoting.

Problem

Each sandbox runs on its own mmap'd native stack with a PROT_NONE guard page below it (wasm_stack.h). The signal handlers are registered with only SA_SIGINFO | SA_RESTART — no sigaltstack. So when a sandbox overflows its stack and faults on the guard page, there's no room on the exhausted stack to deliver SIGSEGV, and the kernel kills the whole runtime instead of trapping just that sandbox. The recovery machinery already exists (current_sandbox_start does sigsetjmp; the handler does current_sandbox_trap → siglongjmp back to a valid part of the stack) — it just can't run because the signal can't be delivered. (Addresses #290.)

Change

Each worker registers a per-thread alternate signal stack (sigaltstack) in worker_thread_main, before unmasking the fault signals.
The synchronous fault signals SIGSEGV and SIGFPE are registered SA_ONSTACK, so they're delivered on the alt stack even when the sandbox stack is exhausted → the existing siglongjmp recovery runs.
SIGALRM / SIGUSR1 are intentionally left off the alt stack. They context-switch away (non-local exit) rather than returning, which doesn't compose with an alternate stack; isolating SA_ONSTACK to the fault handlers keeps the preemption path untouched.

3 files, +44 lines.

Verification

Mechanism proof — a standalone test reproducing the runtime's exact primitives (guard-paged mmap stack run via makecontext/swapcontext, sigsetjmp recovery, handler siglongjmp): without the alt stack the process is killed by SIGSEGV (exit 139); with it, the overflow is RECOVERED (exit 0).
No regression — normal workload (empty, EDF + preemption, 20,000 requests) → 20,000/20,000 200s, runtime alive, no crashes; confirms the alt stack + selective SA_ONSTACK doesn't disturb preemption.
Builds clean.

Why experimental / not fully verified

A real in-sledgert overflow was not exercised end to end: no prebuilt module recurses deeply enough to exhaust the 512 KB native stack (per #290 this is hard to trigger — wasm locals live in linear memory, so it needs deep recursion), and building a custom deeply-recursive module goes through the libsledge .wasm.so toolchain. The change is sound by composition (the recovery path pre-exists and already handles traps like linear-memory-OOB SIGSEGV → 500; this only adds alt-stack delivery, proven by the mechanism test), but I'd want a real overflow test before marking it non-experimental.

Relates to #290.

🤖 Generated with Claude Code

Each sandbox runs on its own mmap'd native stack with a PROT_NONE guard page below it. The signal handlers were registered without an alternate signal stack, so when a sandbox overflows its stack and faults on the guard page there is no room on the exhausted stack to deliver SIGSEGV -- the kernel kills the whole runtime instead of trapping just that sandbox. The recovery machinery already exists (current_sandbox_start sigsetjmps; the handler siglongjmps back via current_sandbox_trap); it simply cannot run because the signal cannot be delivered. Register a per-worker alternate signal stack and flag the synchronous fault handlers (SIGSEGV, SIGFPE) SA_ONSTACK so they run on it, leaving an overflow recoverable. The asynchronous preemption signals (SIGALRM, SIGUSR1) are intentionally left off the alternate stack: they context- switch away rather than returning, which does not compose with an alternate stack, so isolating SA_ONSTACK to the fault handlers keeps the preemption path untouched. Marked experimental: the mechanism is validated by a standalone test that reproduces the runtime's exact primitives (guard-paged mmap stack via makecontext/swapcontext, sigsetjmp recovery, handler siglongjmp) -- which crashes without the alt stack and recovers with it -- and a normal preemptive workload shows no regression, but a real in-runtime overflow was not exercised end to end (no existing module recurses deeply enough, and building one needs the module toolchain). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

emil916 · 2026-06-22T05:20:17Z

This is interesting! We can definitely circle back to this later. So, yes, agreed to keep it open for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experimental] Alternate signal stack to make sandbox stack overflows recoverable#398

[Experimental] Alternate signal stack to make sandbox stack overflows recoverable#398
bushidocodes wants to merge 1 commit into
masterfrom
experimental/sigaltstack-recoverable-overflow

bushidocodes commented Jun 17, 2026

Uh oh!

emil916 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bushidocodes commented Jun 17, 2026

Problem

Change

Verification

Why experimental / not fully verified

Uh oh!

emil916 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants