fix memory leak on lost packets#21
Draft
jeff-hykin wants to merge 4 commits into
Draft
Conversation
Mirrors the C library's `lcm_frag_buf_store` semantics from
udpm_util.{c,h}: cap the in-flight reassembly buffer at 16 MiB total
bytes and 1000 entries, evicting the least-recently-updated entry
when either limit is exceeded.
Before this change the Rust LCM port's reassembly map was an
unbounded HashMap with no eviction; one dropped UDP datagram in a
fragmented message left the other N-1 fragments stuck in the map
forever. Under sustained large-message (>65 KB) load with realistic
multicast loopback drop rates, incomplete entries accumulated, the
locked HashMap grew without bound, lock contention starved the
receive thread, and subsequent message drop rates compounded.
Discovered during DimOS PGO Rust port KITTI-360 benchmarking
(PassivePlatypus task): ~50% drop rate on 10 Hz × 500 KB
PointCloud2 publishes, with verified 128 MB SO_RCVBUF on both sides
of the channel and `set_queue_capacity(10000)` on every
subscription. Buffer-layer mitigations didn't move the drop rate
because the drops were upstream of the socket — in this map.
Implementation:
- Added `last_update: Instant` to `FragmentBuffer`. Updated on every
fragment arrival on that entry.
- New `FragStore` wrapper around the HashMap tracks `total_bytes`
alongside the entries.
- `FragStore::evict_lru()` picks the entry with the oldest
`last_update` (linear scan — matches C library's
`_find_lru_frag_buf` g_hash_table_foreach pass).
- `FragStore::enforce_caps()` evicts in a loop until both caps fit.
Called from `process_fragment` only when a new entry was just
inserted; completed-message removals already bring totals down so
no enforce is needed there.
Caps:
- `MAX_FRAG_BUF_TOTAL_BYTES = 16 MiB` (matches C
`MAX_FRAG_BUF_TOTAL_SIZE = 1 << 24`).
- `MAX_NUM_FRAG_BUFS = 1000` (matches C `MAX_NUM_FRAG_BUFS`).
Tests added (all passing):
- frag_store_evicts_when_over_byte_cap: insert >16 MiB across entries,
assert total_bytes drops to <= cap and at least one entry evicted.
- frag_store_evicts_when_over_entry_cap: insert MAX_NUM_FRAG_BUFS+5
small entries, assert count is exactly MAX_NUM_FRAG_BUFS after.
- frag_store_evict_lru_picks_oldest: two entries with stagged
last_update, verify evict_lru removes the older one.
Existing loopback round-trip tests (6 tests) still pass — eviction
doesn't kick in for happy-path traffic since reassembly entries are
typically removed on completion within microseconds.
See ~/repos/jhist/wiki/rust-lcm-fragmentation-bug.md for the full
discovery narrative + C-vs-Rust diff.
Branch: jeff/fix/rust-frag-reassembly-leak
Adds an optional `recv_buf_size: Option<usize>` field to `LcmOptions` that calls `socket2::set_recv_buffer_size` on the UDP receive socket before binding. None (default) leaves the kernel default in place (typically `net.core.rmem_default` on Linux after `BufferConfiguratorLinux` raises it to 64 MiB). Useful when the LCM client wants a deterministic, larger-than-default buffer for high-rate publishers of fragmented messages — e.g. ~500 KB PointCloud2 over LCM at 10 Hz. The matching change in `dimos/protocol/service/lcmservice.py` passes `recv_buf_size` via the URL query string for the Python LCM client; the Rust port needs an equivalent path. On Linux the kernel doubles the requested value, so requesting 64 MiB yields a usable 128 MiB. socket2 silently clamps to `rmem_max`. This is a non-breaking addition (defaults to `None`); existing `LcmOptions::default()` callers see no behavioral change.
…limits
Usage:
cargo run --release --example burst_test sender <msg_bytes> <count> <interval_ms>
cargo run --release --example burst_test receiver <count>
cargo run --release --example burst_test slow_receiver <count> <handler_ms>
Used during the dimos PGO investigation to bisect "is this LCM or the
application layer?". Findings on this host (Linux, lo interface,
SO_RCVBUF 128 MB after the URL-param fix):
500 KB × 10 Hz × 10 s — 0% drops (4.7 MB/s benchmark rate)
500 KB × 89 Hz × 3 s — 0% drops (42 MB/s)
500 KB × 10 Hz × 100 s
with 200 ms slow handler — 27.6% drops (sustained back-pressure)
500 KB × 100 msgs × 0 ms — 0% drops (213 MB/s peak burst)
Confirms LCM drops only under sustained handler back-pressure where
cumulative backlog exceeds the SO_RCVBUF. Short bursts and well-paced
streams within the kernel buffer's capacity transmit cleanly.
Useful as a regression / hardware-characterization tool for future
hosts and as a unit-style test for LCM library changes.
jeff-hykin
added a commit
to dimensionalOS/dimos
that referenced
this pull request
May 22, 2026
INPUT_CHANNEL_CAPACITY was 16. With LCM publishers at 10 Hz and any handler that processes slower than line rate (e.g. PointCloud2 decode + loop closure on 500 KB messages), the per-input mpsc fills in <2 seconds and `try_send` starts dropping new messages. This was the actual root cause of the ~50% "LCM drops" I'd been investigating on the pgo_rust KITTI-360 benchmark. The kernel reports zero drops (UDP RcvbufErrors=0, lo interface drop=0), and the LCM library was fine — the drops were here, at the Rust dimos-module layer. Empirical verification: 500-scan seq8 run with full ~500 KB clouds: - Before this change: 467/500 corrected_odometry_samples (6.6% loss) - After this change: 500/500 corrected_odometry_samples (0% loss) Bumped both INPUT_CHANNEL_CAPACITY and PUBLISH_CHANNEL_CAPACITY to 1024. At the same 10 Hz publish, the input queue now holds ~100 s of backlog which is plenty for any reasonable handler. The Rust LCM fragmentation-reassembly leak (PR dimensionalOS/dimos-lcm#21) is still a real but separate bug — it prevents memory leaks from dropped fragments but doesn't fix this drop source. Both fixes are independently useful.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When there are dropped LCM packets, the hashmap keeps data forever.
The C LCM version has an LRU eviction policy to avoid this becoming a memory leak, aso this PR ports that to Rust.
Three new tests to cover these cases.
Run the tests:
Also added a
burst_testexample to characterize how much LCM can take before dropping. Run two terminals: