Add Valgrind suppressions for runtime false positives by bushidocodes · Pull Request #390 · gwsystems/sledge-serverless-framework

bushidocodes · 2026-06-16T21:23:24Z

Summary

Running sledgert under Valgrind/Memcheck produces a flood of false Invalid read/write errors — and with preemption on and multiple workers, hundreds of millions of them, including the report tracked in #69 (software_interrupt_handle_signals → propagate_sigalrm → pthread_kill). They are artifacts of three things Memcheck cannot model:

Signal-based preemption delivered while a worker runs on a switched sandbox stack.
User-level context switching (swapcontext / arch_context) — Valgrind prints client switching stacks? and loses track of stack memory.
Direct mmap of sandbox linear memory — large guard regions marked noaccess that the JIT'd wasm module then legitimately touches.

This PR adds tooling so the runtime can be checked under Valgrind without that noise. It does not change any runtime code.

What's included

runtime/tools/valgrind/sledge.supp — suppressions scoped to the false-positive classes only (signal machinery, JIT'd guest code, WASI shims touching guest memory, per-worker queues read after a context switch). It does not hide errors elsewhere in the control plane.
runtime/tools/valgrind/run-valgrind.sh — wraps a control-plane mode (preemption off, one worker — what the existing per-test make valgrind targets use) and a preemptive mode, applying the suppressions and LD_BIND_NOW in both.
docs/valgrind.md — explains why the false positives occur, how to run each mode, and the small documented residual.

runtime/tools/valgrind/run-valgrind.sh -d tests/multi-tenancy-sample                 # control-plane
runtime/tools/valgrind/run-valgrind.sh -m preemptive -w 8 -d tests/multi-tenancy-sample

Why this is safe (no real bug is being hidden)

The pthread_t passed to pthread_kill is a valid mapped descriptor — the worker-thread array is calloc'd, asserted non-zero before use, and the timer is armed only after every worker exists. (The original 2019 bug, a near-null/uninitialised slot, was already fixed.)
--track-origins=yes shows every uninitialised-value report in the scheduler / run-queue / sandbox paths traces to a single origin — the hand-rolled stack setup in arch_context_init — with zero heap-allocation origins.
A differential leak run (20 vs 120 requests) is byte-for-byte identical: no per-request leak. The few "lost" blocks are per-thread TLS/structures that aren't freed only because the runtime is stopped by a signal and never joins its threads.

With the suppressions applied, SLEdge's own runtime code is silent under Valgrind; the only remaining reports are unanchorable libc-level sandbox-memory and host-stdio noise that scales with guest execution.

Closes #69.

🤖 Generated with Claude Code

Running sledgert under Valgrind/Memcheck reports a flood of false "Invalid read/write" errors that are artifacts of three things Memcheck cannot model: signal-based preemption delivered on a switched stack, user-level context switching (swapcontext/arch_context), and direct mmap of sandbox linear memory. With preemption on and multiple workers a representative run produces hundreds of millions of bogus errors, including the long-standing report in the signal-broadcast path (software_interrupt_handle_signals -> propagate_sigalrm -> pthread_kill). None of these indicate a real bug: - The pthread_t passed to pthread_kill is a valid mapped descriptor; the worker-thread array is calloc'd, asserted non-zero before use, and the timer is armed only after all workers exist. - track-origins shows every uninitialised-value report in the scheduler/ run-queue/sandbox paths shares a single origin: the hand-rolled stack setup in arch_context_init (zero heap origins). - A differential leak run (20 vs 120 requests) is byte-for-byte identical: no per-request leak; the few "lost" blocks are per-thread TLS/structures not freed because the runtime is stopped by signal, never joining threads. This adds tooling so the runtime can be checked under Valgrind without that noise: - runtime/tools/valgrind/sledge.supp: suppressions scoped to the false-positive classes only (signal machinery, JIT'd guest code, WASI shims touching guest memory, per-worker queues read after a context switch). It does not hide errors elsewhere in the control plane. - runtime/tools/valgrind/run-valgrind.sh: wraps both a clean control-plane mode (preemption off, one worker) and a preemptive mode, applying the suppressions and LD_BIND_NOW in both. - docs/valgrind.md: explains why the false positives occur, how to run each mode, and the small documented residual. With the suppressions, SLEdge's own runtime code is silent under Valgrind; the only remaining reports are unanchorable libc-level sandbox-memory and host-stdio noise that scales with guest execution. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

emil916 · 2026-06-22T00:00:35Z

This is great! But I can't test this at this time. If you are sure this runs smoothly, then go ahead and merge.

bushidocodes mentioned this pull request Jun 16, 2026

Valgrind reports Invalid Read of Size 4 at software_interrupt_handle_signals #69

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Valgrind suppressions for runtime false positives#390

Add Valgrind suppressions for runtime false positives#390
bushidocodes wants to merge 1 commit into
masterfrom
valgrind-false-positive-suppressions

bushidocodes commented Jun 16, 2026

Uh oh!

emil916 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bushidocodes commented Jun 16, 2026

Summary

What's included

Why this is safe (no real bug is being hidden)

Uh oh!

emil916 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants