Hi antirez,
Thanks for implementing ssd-streaming!
I've been having a lot of fun playing around with it.
Overview
I benchmarked DSV4-Flash SSD streaming on an M4 Max / 128 GiB machine, and in addition to the
existing baseline path, prototyped and compared two alternative strategies (residency-lru / nommap).
This issue is a short report plus improvement proposals. Full numbers, plots, and implementation
notes are in the fork's README (linked at the end).
- Environment: Apple M4 Max 128 GiB / model: deepseek-v4-flash
- Quantization: q2 (actual weights 80.76 GiB, fits in RAM) / q4 (153.33 GiB, does not fit)
- Budgets: 8 / 16 / 32 / 64 / 80 GB (--ssd-streaming-cache-experts)
- Strategies compared (only the routed-expert supply path differs).
baseline / full are upstream as-is; residency-lru / nommap are added in this work:
- Upstream (unchanged):
- baseline (
cafc134) … the default SSD-streaming path (mmap + pread copy + evolved LRU)
- full … non-streaming (whole model resident). Measured for q2 only.
- Added (prototypes in this work):
- residency-lru (
11f20b1) … gen uses a zero-copy mmap view of selected experts +
residency LRU; prefill uses an F_NOCACHE pread double-buffer + MADV_DONTNEED trailing
- nommap (
cae5e95) … no mmap; preads weights from an F_NOCACHE fd into owned buffers
(no page cache)
TL;DR
Framed as characteristics:
-
The fastest case is "memory is sufficient and the GPU runs at 100%"
(the GPU never stalls waiting on weights). Since q2's full weights of 80.76 GiB fit in 128 GiB,
non-streaming full is closest to this state and is fastest. The cost: a ~28s residency build at
startup, and the whole model is wired, so it competes with other processes for RAM.
-
SSD-streaming trades speed to reduce OOM risk for long context / concurrent execution.
By not making the whole model resident it saves memory, but supplying weights on demand
costs speed. For models that don't fit in RAM (e.g. q4 at 153.33 GiB), full is infeasible,
so streaming is required.
-
There are two memory-management families:
- mmap-based (baseline / residency-lru) … rides on the OS page cache.
Fast when RAM is sufficient, but behavior depends on RAM size (won't reproduce on smaller machines).
- nommap … manages memory entirely by itself.
Uses no page cache, so it's RAM-size independent and the most robust against OOM
(details below in "nommap characteristics").
Numbers for representative configs
Because per-expert size is q2 = 6.75 MiB / q4 = 13.50 MiB (2×), the optimal budget is asymmetric
for mmap-based strategies.
- q2 hits enough experts at b32 (in part because the whole model can sit in the page cache).
Raising budget further makes the cache slab eat into the page cache and degrades prefill
(→ for mmap-based strategies, b32 is the practical sweet spot).
- q4's experts are 2× larger, so at the same budget only half as many experts stay resident as q2,
and since the model exceeds RAM, low budget isn't enough. Raising budget (more resident experts)
is faster (→ b80 is best).
Note that nommap uses no page cache, so this asymmetry does not apply: for both q2 and q4, gen
scales with budget (= number of resident experts), and higher budget is faster as far as you can
afford it (see "nommap characteristics" below).
Memory columns: wired = mandatory pinned (cannot be released) / used = total in use (partly
releasable) / file-backed = OS cache portion (mmap-derived, releasable).
q2 / budget 32GB (fits in RAM. full is the ceiling, b32 the practical best)
| config |
prefill |
gen |
wired |
used |
file-backed |
| full (non-stream) |
261.0 |
22.80 |
— |
103 |
— |
| baseline b32 |
249.3 |
19.29 |
48 |
124 |
83 |
| residency-lru b32 |
243.9 |
16.58 |
51 |
103 |
82 |
| nommap b32 |
214.6 |
12.16 |
45 |
61 |
2 |
q4 / budget 80GB (exceeds RAM, streaming required. full N/A)
| config |
prefill |
gen |
wired |
used |
file-backed |
| baseline b80 |
113.2 |
11.00 |
88 |
127 |
116 |
| residency-lru b80 |
129.2 |
11.81 |
101 |
127 |
113 |
| nommap b80 |
126.7 |
7.59 |
88 |
106 |
2 |
(t/s = prefill: cumulative avg at the 100% point of the largest prefill / gen: avg of the longest
gen sequence. Values in GiB.)
nommap characteristics (fixed memory footprint, real-machine-representative perf)
Because nommap uses no page cache, its memory footprint is predictable and the numbers reproduce
independent of RAM size.
| budget(GB) |
prefill t/s |
gen t/s |
wired |
used |
file-backed |
| 8 |
206.5 |
4.52 |
21 |
36 |
2.8 |
| 16 |
210.1 |
6.54 |
31 |
44 |
2.3 |
| 32 |
214.6 |
12.16 |
45 |
61 |
2.1 |
| 64 |
215.9 |
19.57 |
77 |
92 |
1.5 |
| 80 |
215.2 |
19.92 |
85 |
100 |
1.9 |
(q2. wired/used/file-backed in GiB. compressor = 0 / swapout = 0 across all budgets.)
- file-backed stays flat at ≈2 GiB regardless of budget … no mmap, F_NOCACHE pread, so the OS
page cache never grows. This is the substance of the "fixed memory footprint".
- used ≈ fixed base (≈11 GiB) + budget … the budget you set becomes the actual wired RAM.
→ Independent of RAM size, so these numbers reproduce on other machines (given the same budget)
= real-machine-representative performance.
- gen scales with budget (= number of resident experts), reaching near full (22.80 t/s) at b64–b80.
prefill is roughly flat at ≈206–215.
- Slow gen at low budget is not a nommap-specific weakness — it directly reflects "few resident
experts." mmap-based strategies look fast at low budget because they use the remaining RAM of a
128 GiB box as free page cache covering all experts; that speed depends on the 128 GiB premise.
On a machine with less RAM, mmap can't hold the model in page cache either and thrashes via
re-pagein (concretely: for q4 where model > RAM, baseline's DRAM BW drops to 56.8 GB/s and GPU
active to 59%). → nommap's per-budget numbers are "the honest performance for that amount of RAM,"
and being RAM-size independent gives it the best portability.
Detailed report
Full per-budget tables (prefill/gen/memory), vm_stat / mactop analysis, and per-strategy
implementation notes are here:
https://github.com/tkhr-sait/ds4/blob/ssd-streaming-brushup/ssd-streaming-memo/README.md
Improvement proposals
baseline is fast enough with no tuning (the default path). There are essentially 2 areas to improve,
and I prototyped one approach for each. Shared purely as reference (mechanism details in
"prefill speedup notes" and "implementation notes" below).
-
Addressing prefill degradation at high budget (→ residency-lru — a rough, not-yet-successful prototype)
At b64/b80, baseline's cache slab and the page cache fight over RAM → re-pagein → prefill drops.
residency-lru avoids page-cache pollution on the prefill side, giving prefill +15 to +31 vs
baseline at high budget.
That said, this is the least polished of the three, and I wouldn't call it a clean success:
- The prefill gain comes entirely from the prefill-side handling (F_NOCACHE pread
double-buffer + MADV_DONTNEED trailing), not from its namesake gen residency LRU.
- The gen residency LRU itself actually drags q2 gen below baseline (16.58 vs 19.29 at b32).
- It's the most complex to implement, and it saves memory mainly for q2 — q4 stays pinned at
used ≈ 127 GiB.
So the reusable takeaway here is the prefill-side technique; the gen residency LRU is not working
well yet and needs rework.
-
Memory management for models larger than RAM / portability to smaller machines (→ nommap)
By dropping mmap and preading via F_NOCACHE so the page cache never grows: file-backed ≈2 GiB,
RAM-size independent, most robust against OOM.
Cost: gen at low budget is the "honest performance" scaled to the number of resident experts
(see "nommap characteristics" above).
prefill speedup notes
- The reason residency-lru's prefill beats baseline is the prefill-side processing, not the gen
residency LRU. An F_NOCACHE pread double-buffer + MADV_DONTNEED trailing (2 layers behind the
encode front) avoids polluting the page cache. baseline preads experts on the regular fd, so the
prefill sweep pulls the model's pread-source pages into the page cache; at high budget this
competes with the cache slab and thrashes.
- Place MADV_DONTNEED 2 layers behind the encode front — this is the key (too close re-faults,
too far grows the footprint). It keeps the prefill sweep's file-backed footprint to ~2–3 layers.
- Async-prefetch cold ranges with F_RDADVISE (on LRU miss). The prefill cache slab is returned
to the OS on gen resume.
- nommap prefill overlap: background-prefetch the entire gate/up/down expert tensors of layer
il+1 (per-parity double-buffer), overlapping the F_NOCACHE SSD read with layer il's GEMM.
Implementation notes (non-obvious OS / Metal behavior)
These are macOS / Metal behaviors that are hard to discover without actually trying or digging
into the docs.
- F_NOCACHE requires page-aligned I/O. Interleaving short unaligned reads makes the kernel
implicitly re-enable caching and disables F_NOCACHE. Widen the request window to page boundaries
and read in a single pread (page-aligned bounce buffer, 32 MiB ceiling).
- F_NOCACHE alone does not reduce file-backed. The kernel's automatic read-ahead ignores
F_NOCACHE and puts prefetched pages into file-backed, so you must turn off F_RDAHEAD on the
model fd (this cuts gen-time file-backed growth to about 1/3).
- stdio-based reads (KV / staged reads) also land in file-backed, so they need to be bypassed
via a separate path. Whatever the prefill cold-load burst brings in remains as reclaimable cache,
since macOS provides no way to drop an fd's cache (diminishing returns).
- A Metal noCopy MTLBuffer is counted as wired for its full length (a gotcha when wrapping mmap
pages directly).
Shared as reference for the direction of SSD streaming (take speed by depending on the page cache,
or take portability / low memory by dropping the page cache).
Hi antirez,
Thanks for implementing ssd-streaming!
I've been having a lot of fun playing around with it.
Overview
I benchmarked DSV4-Flash SSD streaming on an M4 Max / 128 GiB machine, and in addition to the
existing baseline path, prototyped and compared two alternative strategies (residency-lru / nommap).
This issue is a short report plus improvement proposals. Full numbers, plots, and implementation
notes are in the fork's README (linked at the end).
baseline / full are upstream as-is; residency-lru / nommap are added in this work:
cafc134) … the default SSD-streaming path (mmap + pread copy + evolved LRU)11f20b1) … gen uses a zero-copy mmap view of selected experts +residency LRU; prefill uses an F_NOCACHE pread double-buffer + MADV_DONTNEED trailing
cae5e95) … no mmap; preads weights from an F_NOCACHE fd into owned buffers(no page cache)
TL;DR
Framed as characteristics:
The fastest case is "memory is sufficient and the GPU runs at 100%"
(the GPU never stalls waiting on weights). Since q2's full weights of 80.76 GiB fit in 128 GiB,
non-streaming
fullis closest to this state and is fastest. The cost: a ~28s residency build atstartup, and the whole model is wired, so it competes with other processes for RAM.
SSD-streaming trades speed to reduce OOM risk for long context / concurrent execution.
By not making the whole model resident it saves memory, but supplying weights on demand
costs speed. For models that don't fit in RAM (e.g. q4 at 153.33 GiB),
fullis infeasible,so streaming is required.
There are two memory-management families:
Fast when RAM is sufficient, but behavior depends on RAM size (won't reproduce on smaller machines).
Uses no page cache, so it's RAM-size independent and the most robust against OOM
(details below in "nommap characteristics").
Numbers for representative configs
Because per-expert size is q2 = 6.75 MiB / q4 = 13.50 MiB (2×), the optimal budget is asymmetric
for mmap-based strategies.
Raising budget further makes the cache slab eat into the page cache and degrades prefill
(→ for mmap-based strategies, b32 is the practical sweet spot).
and since the model exceeds RAM, low budget isn't enough. Raising budget (more resident experts)
is faster (→ b80 is best).
Note that nommap uses no page cache, so this asymmetry does not apply: for both q2 and q4, gen
scales with budget (= number of resident experts), and higher budget is faster as far as you can
afford it (see "nommap characteristics" below).
Memory columns: wired = mandatory pinned (cannot be released) / used = total in use (partly
releasable) / file-backed = OS cache portion (mmap-derived, releasable).
q2 / budget 32GB (fits in RAM. full is the ceiling, b32 the practical best)
q4 / budget 80GB (exceeds RAM, streaming required. full N/A)
(t/s = prefill: cumulative avg at the 100% point of the largest prefill / gen: avg of the longest
gen sequence. Values in GiB.)
nommap characteristics (fixed memory footprint, real-machine-representative perf)
Because nommap uses no page cache, its memory footprint is predictable and the numbers reproduce
independent of RAM size.
(q2. wired/used/file-backed in GiB. compressor = 0 / swapout = 0 across all budgets.)
page cache never grows. This is the substance of the "fixed memory footprint".
→ Independent of RAM size, so these numbers reproduce on other machines (given the same budget)
= real-machine-representative performance.
prefill is roughly flat at ≈206–215.
experts." mmap-based strategies look fast at low budget because they use the remaining RAM of a
128 GiB box as free page cache covering all experts; that speed depends on the 128 GiB premise.
On a machine with less RAM, mmap can't hold the model in page cache either and thrashes via
re-pagein (concretely: for q4 where model > RAM, baseline's DRAM BW drops to 56.8 GB/s and GPU
active to 59%). → nommap's per-budget numbers are "the honest performance for that amount of RAM,"
and being RAM-size independent gives it the best portability.
Detailed report
Full per-budget tables (prefill/gen/memory), vm_stat / mactop analysis, and per-strategy
implementation notes are here:
https://github.com/tkhr-sait/ds4/blob/ssd-streaming-brushup/ssd-streaming-memo/README.md
Improvement proposals
baseline is fast enough with no tuning (the default path). There are essentially 2 areas to improve,
and I prototyped one approach for each. Shared purely as reference (mechanism details in
"prefill speedup notes" and "implementation notes" below).
Addressing prefill degradation at high budget (→ residency-lru — a rough, not-yet-successful prototype)
At b64/b80, baseline's cache slab and the page cache fight over RAM → re-pagein → prefill drops.
residency-lru avoids page-cache pollution on the prefill side, giving prefill +15 to +31 vs
baseline at high budget.
That said, this is the least polished of the three, and I wouldn't call it a clean success:
double-buffer + MADV_DONTNEED trailing), not from its namesake gen residency LRU.
used ≈ 127 GiB.
So the reusable takeaway here is the prefill-side technique; the gen residency LRU is not working
well yet and needs rework.
Memory management for models larger than RAM / portability to smaller machines (→ nommap)
By dropping mmap and preading via F_NOCACHE so the page cache never grows: file-backed ≈2 GiB,
RAM-size independent, most robust against OOM.
Cost: gen at low budget is the "honest performance" scaled to the number of resident experts
(see "nommap characteristics" above).
prefill speedup notes
residency LRU. An F_NOCACHE pread double-buffer + MADV_DONTNEED trailing (2 layers behind the
encode front) avoids polluting the page cache. baseline preads experts on the regular fd, so the
prefill sweep pulls the model's pread-source pages into the page cache; at high budget this
competes with the cache slab and thrashes.
too far grows the footprint). It keeps the prefill sweep's file-backed footprint to ~2–3 layers.
to the OS on gen resume.
il+1 (per-parity double-buffer), overlapping the F_NOCACHE SSD read with layer il's GEMM.
Implementation notes (non-obvious OS / Metal behavior)
These are macOS / Metal behaviors that are hard to discover without actually trying or digging
into the docs.
implicitly re-enable caching and disables F_NOCACHE. Widen the request window to page boundaries
and read in a single pread (page-aligned bounce buffer, 32 MiB ceiling).
F_NOCACHE and puts prefetched pages into file-backed, so you must turn off F_RDAHEAD on the
model fd (this cuts gen-time file-backed growth to about 1/3).
via a separate path. Whatever the prefill cold-load burst brings in remains as reclaimable cache,
since macOS provides no way to drop an fd's cache (diminishing returns).
pages directly).
Shared as reference for the direction of SSD streaming (take speed by depending on the page cache,
or take portability / low memory by dropping the page cache).