Add Valgrind suppressions for runtime false positives#390
Open
bushidocodes wants to merge 1 commit into
Open
Conversation
Running sledgert under Valgrind/Memcheck reports a flood of false "Invalid read/write" errors that are artifacts of three things Memcheck cannot model: signal-based preemption delivered on a switched stack, user-level context switching (swapcontext/arch_context), and direct mmap of sandbox linear memory. With preemption on and multiple workers a representative run produces hundreds of millions of bogus errors, including the long-standing report in the signal-broadcast path (software_interrupt_handle_signals -> propagate_sigalrm -> pthread_kill). None of these indicate a real bug: - The pthread_t passed to pthread_kill is a valid mapped descriptor; the worker-thread array is calloc'd, asserted non-zero before use, and the timer is armed only after all workers exist. - track-origins shows every uninitialised-value report in the scheduler/ run-queue/sandbox paths shares a single origin: the hand-rolled stack setup in arch_context_init (zero heap origins). - A differential leak run (20 vs 120 requests) is byte-for-byte identical: no per-request leak; the few "lost" blocks are per-thread TLS/structures not freed because the runtime is stopped by signal, never joining threads. This adds tooling so the runtime can be checked under Valgrind without that noise: - runtime/tools/valgrind/sledge.supp: suppressions scoped to the false-positive classes only (signal machinery, JIT'd guest code, WASI shims touching guest memory, per-worker queues read after a context switch). It does not hide errors elsewhere in the control plane. - runtime/tools/valgrind/run-valgrind.sh: wraps both a clean control-plane mode (preemption off, one worker) and a preemptive mode, applying the suppressions and LD_BIND_NOW in both. - docs/valgrind.md: explains why the false positives occur, how to run each mode, and the small documented residual. With the suppressions, SLEdge's own runtime code is silent under Valgrind; the only remaining reports are unanchorable libc-level sandbox-memory and host-stdio noise that scales with guest execution. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
|
This is great! But I can't test this at this time. If you are sure this runs smoothly, then go ahead and merge. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Running
sledgertunder Valgrind/Memcheck produces a flood of falseInvalid read/writeerrors — and with preemption on and multiple workers, hundreds of millions of them, including the report tracked in #69 (software_interrupt_handle_signals→propagate_sigalrm→pthread_kill). They are artifacts of three things Memcheck cannot model:swapcontext/arch_context) — Valgrind printsclient switching stacks?and loses track of stack memory.mmapof sandbox linear memory — large guard regions markednoaccessthat the JIT'd wasm module then legitimately touches.This PR adds tooling so the runtime can be checked under Valgrind without that noise. It does not change any runtime code.
What's included
runtime/tools/valgrind/sledge.supp— suppressions scoped to the false-positive classes only (signal machinery, JIT'd guest code, WASI shims touching guest memory, per-worker queues read after a context switch). It does not hide errors elsewhere in the control plane.runtime/tools/valgrind/run-valgrind.sh— wraps acontrol-planemode (preemption off, one worker — what the existing per-testmake valgrindtargets use) and apreemptivemode, applying the suppressions andLD_BIND_NOWin both.docs/valgrind.md— explains why the false positives occur, how to run each mode, and the small documented residual.runtime/tools/valgrind/run-valgrind.sh -d tests/multi-tenancy-sample # control-plane runtime/tools/valgrind/run-valgrind.sh -m preemptive -w 8 -d tests/multi-tenancy-sampleWhy this is safe (no real bug is being hidden)
pthread_tpassed topthread_killis a valid mapped descriptor — the worker-thread array iscalloc'd, asserted non-zero before use, and the timer is armed only after every worker exists. (The original 2019 bug, a near-null/uninitialised slot, was already fixed.)--track-origins=yesshows every uninitialised-value report in the scheduler / run-queue / sandbox paths traces to a single origin — the hand-rolled stack setup inarch_context_init— with zero heap-allocation origins.With the suppressions applied, SLEdge's own runtime code is silent under Valgrind; the only remaining reports are unanchorable libc-level sandbox-memory and host-stdio noise that scales with guest execution.
Closes #69.
🤖 Generated with Claude Code