Skip to content

feat(sim): add ring-buffer wrapper for gvsoc trace to bound disk usage#29

Open
runwangdl wants to merge 1 commit into
develfrom
feat/gvsoc-ring-trace
Open

feat(sim): add ring-buffer wrapper for gvsoc trace to bound disk usage#29
runwangdl wants to merge 1 commit into
develfrom
feat/gvsoc-ring-trace

Conversation

@runwangdl
Copy link
Copy Markdown
Owner

When debugging gvsoc hangs on Siracusa (which has no working UART) the only progress signal is gvsoc's --trace output. A full --trace=insn run on a real-size network produces tens of GB before the simulation completes — enough to fill the build host's disk on a single attempt.

What this PR adds

scripts/ring_tee.py (~170 lines, pure stdlib)

Reads stdin, rotates writes across N files of fixed size each, and discards the oldest content once keep * size is full. Three runtime features make it useful for hang debugging:

  • 5-second stderr heartbeat — distinguishes "gvsoc itself froze" (heartbeat stops because stdin went silent) from "simulated firmware deadlocked" (heartbeat continues but simulated PC stays still in the trace).
  • SIGUSR1 handler — writes <prefix>.snapshot as a chronological concatenation of every file still on disk (oldest → newest), so a single kill -USR1 <pid> grabs the recent window at the moment of the hang.
  • Ring rotation — total disk use is bounded by keep * size (default 600 MB) regardless of how long the simulation runs.

cmake/simulation.cmake options

Option Default Purpose
GVSOC_RING_TRACE OFF Enable the wrapper. Default OFF — behaviour identical to before when not set.
GVSOC_RING_TRACE_SIZE 500M Per-file size, K/M/G suffix accepted.
GVSOC_RING_TRACE_KEEP 3 Number of rotating files.

When GVSOC_RING_TRACE=ON, the gvsoc_<name> target pipes gvsoc's stderr through ring_tee.py via bash process substitution.

Usage

cmake -DGVSOC_RING_TRACE=ON \
      -DGVSOC_RING_TRACE_SIZE=500M \
      -DGVSOC_RING_TRACE_KEEP=3 \
      ...
make gvsoc_resnet8_train

Trace files land at:

  • <build>/gvsoc_workdir/gv_trace.0 / .1 / .2
  • <build>/gvsoc_workdir/gv_trace.current — symlink → currently-writing file

When stuck:

kill -USR1 $(pgrep -f ring_tee.py)
# inspect <build>/gvsoc_workdir/gv_trace.snapshot (chronological full window)

Known limitations (follow-up)

  1. This gvsoc build emits --trace=insn output on stdout, not stderr — to capture it via this wrapper you currently need a hand-written bash script that pipes stdout into ring_tee, rather than using the CMake target.
  2. The CMake macro's bash -c '...' quoting loses process substitution when expanded by make's /bin/sh (POSIX mode >(...) is treated as a literal filename and aborts).

Both limitations are left to a follow-up; this PR lands the wrapper script and CMake plumbing so the infrastructure is in place. A bash wrapper script using ring_tee directly (bypassing the CMake macro) was used successfully today to debug the PR #19 graph-I/O promote hang.

When debugging gvsoc hangs on Siracusa (which has no working UART) the
only progress signal is gvsoc's --trace output. A full --trace=insn run
on a real-size network produces tens of GB before the simulation
completes -- enough to fill the build host's disk on a single attempt.

scripts/ring_tee.py is a ~170-line pure-stdlib wrapper that reads stdin,
rotates writes across N files of fixed size each, and discards the
oldest content once the keep * size quota is full. Three runtime
features make it useful for hang debugging:

  * 5-second stderr heartbeat: distinguishes "gvsoc itself froze"
    (heartbeat stops because stdin is empty) from "simulated firmware
    deadlocked" (heartbeat continues but simulated PC stays still in
    the trace).
  * SIGUSR1 handler: writes <prefix>.snapshot as a chronological
    concatenation of every file still on disk (oldest -> newest), so a
    single `kill -USR1 <pid>` grabs the recent window at the moment of
    the hang.
  * Ring rotation: total disk use is bounded by keep * size (default
    600 MB) regardless of how long the simulation runs.

cmake/simulation.cmake gains three options:

  GVSOC_RING_TRACE       OFF        Enable the wrapper (default OFF;
                                    behaviour identical to before when
                                    not set).
  GVSOC_RING_TRACE_SIZE  500M       Per-file size, K/M/G suffix accepted.
  GVSOC_RING_TRACE_KEEP  3          Number of rotating files.

When GVSOC_RING_TRACE=ON, the gvsoc_<name> target pipes gvsoc's stderr
through ring_tee.py via bash process substitution. Default OFF.

Known limitation: this gvsoc build (PULP siracusa target) emits its
--trace=insn output on stdout rather than stderr. To capture that
output via this wrapper you currently need to invoke gvsoc through a
hand-written bash script that pipes stdout into ring_tee, rather than
using the CMake target -- the macro's bash -c quoting also loses the
process substitution when expanded by make's /bin/sh. Both limitations
left to a follow-up; this commit lands the wrapper script and CMake
plumbing so the infrastructure is in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
runwangdl added a commit that referenced this pull request May 15, 2026
Diagnostic wrapper used during BN promote debugging on MobileNetV1.
gvsoc's --trace=insn output is multi-GB-per-minute on real models;
this rotates writes across N fixed-size files so total disk usage
stays under keep * size regardless of run length.

Features used in this session:
  - 5s stderr heartbeat to distinguish gvsoc-frozen from
    simulated-firmware-deadlocked (heartbeat continues when sim
    is stuck but cluster_fork keeps emitting trace).
  - Ring rotation (default 3 x 500MB) so debug runs do not fill
    the build host disk.
  - SIGUSR1 snapshot handler (chronological concatenation of all
    files still on disk) for grabbing the recent window when
    gvsoc actually hangs.

Companion to PR #29 (cmake/simulation.cmake wiring). Invoked via a
hand-written bash script that pipes gvsoc stdout into ring_tee.py,
since on the Siracusa build --trace output goes to stdout rather
than stderr -- the cmake macro's bash -c quoting cannot reliably
expand process substitution under make's /bin/sh, so the direct
script path is the working entry point for now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant