Skip to content

Add Valgrind suppressions for runtime false positives#390

Open
bushidocodes wants to merge 1 commit into
masterfrom
valgrind-false-positive-suppressions
Open

Add Valgrind suppressions for runtime false positives#390
bushidocodes wants to merge 1 commit into
masterfrom
valgrind-false-positive-suppressions

Conversation

@bushidocodes

Copy link
Copy Markdown
Contributor

Summary

Running sledgert under Valgrind/Memcheck produces a flood of false Invalid read/write errors — and with preemption on and multiple workers, hundreds of millions of them, including the report tracked in #69 (software_interrupt_handle_signalspropagate_sigalrmpthread_kill). They are artifacts of three things Memcheck cannot model:

  1. Signal-based preemption delivered while a worker runs on a switched sandbox stack.
  2. User-level context switching (swapcontext / arch_context) — Valgrind prints client switching stacks? and loses track of stack memory.
  3. Direct mmap of sandbox linear memory — large guard regions marked noaccess that the JIT'd wasm module then legitimately touches.

This PR adds tooling so the runtime can be checked under Valgrind without that noise. It does not change any runtime code.

What's included

  • runtime/tools/valgrind/sledge.supp — suppressions scoped to the false-positive classes only (signal machinery, JIT'd guest code, WASI shims touching guest memory, per-worker queues read after a context switch). It does not hide errors elsewhere in the control plane.
  • runtime/tools/valgrind/run-valgrind.sh — wraps a control-plane mode (preemption off, one worker — what the existing per-test make valgrind targets use) and a preemptive mode, applying the suppressions and LD_BIND_NOW in both.
  • docs/valgrind.md — explains why the false positives occur, how to run each mode, and the small documented residual.
runtime/tools/valgrind/run-valgrind.sh -d tests/multi-tenancy-sample                 # control-plane
runtime/tools/valgrind/run-valgrind.sh -m preemptive -w 8 -d tests/multi-tenancy-sample

Why this is safe (no real bug is being hidden)

  • The pthread_t passed to pthread_kill is a valid mapped descriptor — the worker-thread array is calloc'd, asserted non-zero before use, and the timer is armed only after every worker exists. (The original 2019 bug, a near-null/uninitialised slot, was already fixed.)
  • --track-origins=yes shows every uninitialised-value report in the scheduler / run-queue / sandbox paths traces to a single origin — the hand-rolled stack setup in arch_context_init — with zero heap-allocation origins.
  • A differential leak run (20 vs 120 requests) is byte-for-byte identical: no per-request leak. The few "lost" blocks are per-thread TLS/structures that aren't freed only because the runtime is stopped by a signal and never joins its threads.

With the suppressions applied, SLEdge's own runtime code is silent under Valgrind; the only remaining reports are unanchorable libc-level sandbox-memory and host-stdio noise that scales with guest execution.

Closes #69.

🤖 Generated with Claude Code

Running sledgert under Valgrind/Memcheck reports a flood of false
"Invalid read/write" errors that are artifacts of three things Memcheck
cannot model: signal-based preemption delivered on a switched stack,
user-level context switching (swapcontext/arch_context), and direct mmap
of sandbox linear memory. With preemption on and multiple workers a
representative run produces hundreds of millions of bogus errors,
including the long-standing report in the signal-broadcast path
(software_interrupt_handle_signals -> propagate_sigalrm -> pthread_kill).

None of these indicate a real bug:
- The pthread_t passed to pthread_kill is a valid mapped descriptor; the
  worker-thread array is calloc'd, asserted non-zero before use, and the
  timer is armed only after all workers exist.
- track-origins shows every uninitialised-value report in the scheduler/
  run-queue/sandbox paths shares a single origin: the hand-rolled stack
  setup in arch_context_init (zero heap origins).
- A differential leak run (20 vs 120 requests) is byte-for-byte identical:
  no per-request leak; the few "lost" blocks are per-thread TLS/structures
  not freed because the runtime is stopped by signal, never joining
  threads.

This adds tooling so the runtime can be checked under Valgrind without
that noise:
- runtime/tools/valgrind/sledge.supp: suppressions scoped to the
  false-positive classes only (signal machinery, JIT'd guest code, WASI
  shims touching guest memory, per-worker queues read after a context
  switch). It does not hide errors elsewhere in the control plane.
- runtime/tools/valgrind/run-valgrind.sh: wraps both a clean control-plane
  mode (preemption off, one worker) and a preemptive mode, applying the
  suppressions and LD_BIND_NOW in both.
- docs/valgrind.md: explains why the false positives occur, how to run
  each mode, and the small documented residual.

With the suppressions, SLEdge's own runtime code is silent under Valgrind;
the only remaining reports are unanchorable libc-level sandbox-memory and
host-stdio noise that scales with guest execution.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@emil916

emil916 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

This is great! But I can't test this at this time. If you are sure this runs smoothly, then go ahead and merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Valgrind reports Invalid Read of Size 4 at software_interrupt_handle_signals

2 participants