fix memory leak on lost packets by jeff-hykin · Pull Request #21 · dimensionalOS/dimos-lcm

jeff-hykin · 2026-05-20T17:00:40Z

When there are dropped LCM packets, the hashmap keeps data forever.

The C LCM version has an LRU eviction policy to avoid this becoming a memory leak, aso this PR ports that to Rust.

Three new tests to cover these cases.

Run the tests:

cd tools/rust/lcm
cargo test

Also added a burst_test example to characterize how much LCM can take before dropping. Run two terminals:

cargo run --release --example burst_test receiver 100
cargo run --release --example burst_test sender 500000 100 100

Mirrors the C library's `lcm_frag_buf_store` semantics from udpm_util.{c,h}: cap the in-flight reassembly buffer at 16 MiB total bytes and 1000 entries, evicting the least-recently-updated entry when either limit is exceeded. Before this change the Rust LCM port's reassembly map was an unbounded HashMap with no eviction; one dropped UDP datagram in a fragmented message left the other N-1 fragments stuck in the map forever. Under sustained large-message (>65 KB) load with realistic multicast loopback drop rates, incomplete entries accumulated, the locked HashMap grew without bound, lock contention starved the receive thread, and subsequent message drop rates compounded. Discovered during DimOS PGO Rust port KITTI-360 benchmarking (PassivePlatypus task): ~50% drop rate on 10 Hz × 500 KB PointCloud2 publishes, with verified 128 MB SO_RCVBUF on both sides of the channel and `set_queue_capacity(10000)` on every subscription. Buffer-layer mitigations didn't move the drop rate because the drops were upstream of the socket — in this map. Implementation: - Added `last_update: Instant` to `FragmentBuffer`. Updated on every fragment arrival on that entry. - New `FragStore` wrapper around the HashMap tracks `total_bytes` alongside the entries. - `FragStore::evict_lru()` picks the entry with the oldest `last_update` (linear scan — matches C library's `_find_lru_frag_buf` g_hash_table_foreach pass). - `FragStore::enforce_caps()` evicts in a loop until both caps fit. Called from `process_fragment` only when a new entry was just inserted; completed-message removals already bring totals down so no enforce is needed there. Caps: - `MAX_FRAG_BUF_TOTAL_BYTES = 16 MiB` (matches C `MAX_FRAG_BUF_TOTAL_SIZE = 1 << 24`). - `MAX_NUM_FRAG_BUFS = 1000` (matches C `MAX_NUM_FRAG_BUFS`). Tests added (all passing): - frag_store_evicts_when_over_byte_cap: insert >16 MiB across entries, assert total_bytes drops to <= cap and at least one entry evicted. - frag_store_evicts_when_over_entry_cap: insert MAX_NUM_FRAG_BUFS+5 small entries, assert count is exactly MAX_NUM_FRAG_BUFS after. - frag_store_evict_lru_picks_oldest: two entries with stagged last_update, verify evict_lru removes the older one. Existing loopback round-trip tests (6 tests) still pass — eviction doesn't kick in for happy-path traffic since reassembly entries are typically removed on completion within microseconds. See ~/repos/jhist/wiki/rust-lcm-fragmentation-bug.md for the full discovery narrative + C-vs-Rust diff. Branch: jeff/fix/rust-frag-reassembly-leak

Adds an optional `recv_buf_size: Option<usize>` field to `LcmOptions` that calls `socket2::set_recv_buffer_size` on the UDP receive socket before binding. None (default) leaves the kernel default in place (typically `net.core.rmem_default` on Linux after `BufferConfiguratorLinux` raises it to 64 MiB). Useful when the LCM client wants a deterministic, larger-than-default buffer for high-rate publishers of fragmented messages — e.g. ~500 KB PointCloud2 over LCM at 10 Hz. The matching change in `dimos/protocol/service/lcmservice.py` passes `recv_buf_size` via the URL query string for the Python LCM client; the Rust port needs an equivalent path. On Linux the kernel doubles the requested value, so requesting 64 MiB yields a usable 128 MiB. socket2 silently clamps to `rmem_max`. This is a non-breaking addition (defaults to `None`); existing `LcmOptions::default()` callers see no behavioral change.

…limits Usage: cargo run --release --example burst_test sender <msg_bytes> <count> <interval_ms> cargo run --release --example burst_test receiver <count> cargo run --release --example burst_test slow_receiver <count> <handler_ms> Used during the dimos PGO investigation to bisect "is this LCM or the application layer?". Findings on this host (Linux, lo interface, SO_RCVBUF 128 MB after the URL-param fix): 500 KB × 10 Hz × 10 s — 0% drops (4.7 MB/s benchmark rate) 500 KB × 89 Hz × 3 s — 0% drops (42 MB/s) 500 KB × 10 Hz × 100 s with 200 ms slow handler — 27.6% drops (sustained back-pressure) 500 KB × 100 msgs × 0 ms — 0% drops (213 MB/s peak burst) Confirms LCM drops only under sustained handler back-pressure where cumulative backlog exceeds the SO_RCVBUF. Short bursts and well-paced streams within the kernel buffer's capacity transmit cleanly. Useful as a regression / hardware-characterization tool for future hosts and as a unit-style test for LCM library changes.

INPUT_CHANNEL_CAPACITY was 16. With LCM publishers at 10 Hz and any handler that processes slower than line rate (e.g. PointCloud2 decode + loop closure on 500 KB messages), the per-input mpsc fills in <2 seconds and `try_send` starts dropping new messages. This was the actual root cause of the ~50% "LCM drops" I'd been investigating on the pgo_rust KITTI-360 benchmark. The kernel reports zero drops (UDP RcvbufErrors=0, lo interface drop=0), and the LCM library was fine — the drops were here, at the Rust dimos-module layer. Empirical verification: 500-scan seq8 run with full ~500 KB clouds: - Before this change: 467/500 corrected_odometry_samples (6.6% loss) - After this change: 500/500 corrected_odometry_samples (0% loss) Bumped both INPUT_CHANNEL_CAPACITY and PUBLISH_CHANNEL_CAPACITY to 1024. At the same 10 Hz publish, the input queue now holds ~100 s of backlog which is plenty for any reasonable handler. The Rust LCM fragmentation-reassembly leak (PR dimensionalOS/dimos-lcm#21) is still a real but separate bug — it prevents memory leaks from dropped fragments but doesn't fix this drop source. Both fixes are independently useful.

jeff-hykin added 3 commits May 20, 2026 09:38

jeff-hykin changed the title ~~fix: evict stale fragments from reassembly map~~ fix memory leak on lost packets May 20, 2026

Simplify comments

3555e97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix memory leak on lost packets#21

fix memory leak on lost packets#21
jeff-hykin wants to merge 4 commits into
rust-codegenfrom
jeff/fix/rust-frag-reassembly-leak

jeff-hykin commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeff-hykin commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jeff-hykin commented May 20, 2026 •

edited

Loading