SSD-streaming benchmark results and improvement proposals

Hi antirez,
Thanks for implementing ssd-streaming!
I've been having a lot of fun playing around with it.

## Overview

I benchmarked DSV4-Flash SSD streaming on an M4 Max / 128 GiB machine, and in addition to the
existing baseline path, prototyped and compared two alternative strategies (residency-lru / nommap).
This issue is a short report plus improvement proposals. Full numbers, plots, and implementation
notes are in the fork's README (linked at the end).

- Environment: Apple M4 Max 128 GiB / model: deepseek-v4-flash
- Quantization: q2 (actual weights 80.76 GiB, fits in RAM) / q4 (153.33 GiB, does not fit)
- Budgets: 8 / 16 / 32 / 64 / 80 GB (--ssd-streaming-cache-experts)
- Strategies compared (only the routed-expert supply path differs).
  **baseline / full are upstream as-is; residency-lru / nommap are added in this work:**
  - Upstream (unchanged):
    - baseline (`cafc134`) … the default SSD-streaming path (mmap + pread copy + evolved LRU)
    - full … non-streaming (whole model resident). Measured for q2 only.
  - Added (prototypes in this work):
    - residency-lru (`11f20b1`) … gen uses a zero-copy mmap view of selected experts +
      residency LRU; prefill uses an F_NOCACHE pread double-buffer + MADV_DONTNEED trailing
    - nommap (`cae5e95`) … no mmap; preads weights from an F_NOCACHE fd into owned buffers
      (no page cache)

## TL;DR

Framed as characteristics:

- **The fastest case is "memory is sufficient and the GPU runs at 100%"**
  (the GPU never stalls waiting on weights). Since q2's full weights of 80.76 GiB fit in 128 GiB,
  non-streaming `full` is closest to this state and is fastest. The cost: a ~28s residency build at
  startup, and the whole model is wired, so it competes with other processes for RAM.

- **SSD-streaming trades speed to reduce OOM risk for long context / concurrent execution.**
  By not making the whole model resident it saves memory, but supplying weights on demand
  costs speed. For models that don't fit in RAM (e.g. q4 at 153.33 GiB), `full` is infeasible,
  so streaming is required.

- **There are two memory-management families:**
  - **mmap-based (baseline / residency-lru) … rides on the OS page cache.**
    Fast when RAM is sufficient, but behavior depends on RAM size (won't reproduce on smaller machines).
  - **nommap … manages memory entirely by itself.**
    Uses no page cache, so it's RAM-size independent and the most robust against OOM
    (details below in "nommap characteristics").

### Numbers for representative configs

Because per-expert size is q2 = 6.75 MiB / q4 = 13.50 MiB (2×), **the optimal budget is asymmetric
for mmap-based strategies.**

- q2 hits enough experts at b32 (in part because the whole model can sit in the page cache).
  Raising budget further makes the cache slab eat into the page cache and degrades prefill
  (→ for mmap-based strategies, b32 is the practical sweet spot).
- q4's experts are 2× larger, so at the same budget only half as many experts stay resident as q2,
  and since the model exceeds RAM, low budget isn't enough. Raising budget (more resident experts)
  is faster (→ b80 is best).

Note that nommap uses no page cache, so this asymmetry does not apply: for both q2 and q4, gen
scales with budget (= number of resident experts), and higher budget is faster as far as you can
afford it (see "nommap characteristics" below).

Memory columns: **wired = mandatory pinned (cannot be released) / used = total in use (partly
releasable) / file-backed = OS cache portion (mmap-derived, releasable)**.

**q2 / budget 32GB (fits in RAM. full is the ceiling, b32 the practical best)**

| config | prefill | gen | wired | used | file-backed |
| --- | --- | --- | --- | --- | --- |
| **full (non-stream)** | **261.0** | **22.80** | — | 103 | — |
| baseline b32 | 249.3 | 19.29 | 48 | 124 | 83 |
| residency-lru b32 | 243.9 | 16.58 | 51 | 103 | 82 |
| nommap b32 | 214.6 | 12.16 | 45 | 61 | **2** |

**q4 / budget 80GB (exceeds RAM, streaming required. full N/A)**

| config | prefill | gen | wired | used | file-backed |
| --- | --- | --- | --- | --- | --- |
| baseline b80 | 113.2 | 11.00 | 88 | 127 | 116 |
| **residency-lru b80** | **129.2** | **11.81** | 101 | 127 | 113 |
| nommap b80 | 126.7 | 7.59 | 88 | 106 | **2** |

(t/s = prefill: cumulative avg at the 100% point of the largest prefill / gen: avg of the longest
gen sequence. Values in GiB.)

### nommap characteristics (fixed memory footprint, real-machine-representative perf)

Because nommap uses no page cache, its memory footprint is predictable and the numbers reproduce
independent of RAM size.

| budget(GB) | prefill t/s | gen t/s | wired | used | file-backed |
| --- | --- | --- | --- | --- | --- |
| 8  | 206.5 | 4.52  | 21 | 36  | 2.8 |
| 16 | 210.1 | 6.54  | 31 | 44  | 2.3 |
| 32 | 214.6 | 12.16 | 45 | 61  | 2.1 |
| 64 | 215.9 | 19.57 | 77 | 92  | 1.5 |
| 80 | 215.2 | 19.92 | 85 | 100 | 1.9 |

(q2. wired/used/file-backed in GiB. compressor = 0 / swapout = 0 across all budgets.)

- **file-backed stays flat at ≈2 GiB regardless of budget** … no mmap, F_NOCACHE pread, so the OS
  page cache never grows. This is the substance of the "fixed memory footprint".
- **used ≈ fixed base (≈11 GiB) + budget** … the budget you set becomes the actual wired RAM.
  → Independent of RAM size, so these numbers reproduce on other machines (given the same budget)
  = real-machine-representative performance.
- **gen scales with budget (= number of resident experts), reaching near full (22.80 t/s) at b64–b80**.
  prefill is roughly flat at ≈206–215.
- **Slow gen at low budget is not a nommap-specific weakness — it directly reflects "few resident
  experts."** mmap-based strategies look fast at low budget because they use the remaining RAM of a
  128 GiB box as free page cache covering all experts; that speed depends on the 128 GiB premise.
  On a machine with less RAM, mmap can't hold the model in page cache either and thrashes via
  re-pagein (concretely: for q4 where model > RAM, baseline's DRAM BW drops to 56.8 GB/s and GPU
  active to 59%). → nommap's per-budget numbers are "the honest performance for that amount of RAM,"
  and being RAM-size independent gives it the best portability.

## Detailed report

Full per-budget tables (prefill/gen/memory), vm_stat / mactop analysis, and per-strategy
implementation notes are here:
https://github.com/tkhr-sait/ds4/blob/ssd-streaming-brushup/ssd-streaming-memo/README.md

## Improvement proposals

baseline is fast enough with no tuning (the default path). There are essentially 2 areas to improve,
and I prototyped one approach for each. Shared purely as reference (mechanism details in
"prefill speedup notes" and "implementation notes" below).

1. **Addressing prefill degradation at high budget (→ residency-lru — a rough, not-yet-successful prototype)**
   At b64/b80, baseline's cache slab and the page cache fight over RAM → re-pagein → prefill drops.
   residency-lru avoids page-cache pollution on the prefill side, giving prefill +15 to +31 vs
   baseline at high budget.
   That said, this is the least polished of the three, and I wouldn't call it a clean success:
   - The prefill gain comes **entirely from the prefill-side handling** (F_NOCACHE pread
     double-buffer + MADV_DONTNEED trailing), **not** from its namesake gen residency LRU.
   - The gen residency LRU itself actually drags q2 gen **below** baseline (16.58 vs 19.29 at b32).
   - It's the most complex to implement, and it saves memory mainly for q2 — q4 stays pinned at
     used ≈ 127 GiB.
   So the reusable takeaway here is the prefill-side technique; the gen residency LRU is not working
   well yet and needs rework.

2. **Memory management for models larger than RAM / portability to smaller machines (→ nommap)**
   By dropping mmap and preading via F_NOCACHE so the page cache never grows: file-backed ≈2 GiB,
   RAM-size independent, most robust against OOM.
   Cost: gen at low budget is the "honest performance" scaled to the number of resident experts
   (see "nommap characteristics" above).

### prefill speedup notes

- **The reason residency-lru's prefill beats baseline is the prefill-side processing, not the gen
  residency LRU.** An F_NOCACHE pread double-buffer + MADV_DONTNEED trailing (2 layers behind the
  encode front) avoids polluting the page cache. baseline preads experts on the regular fd, so the
  prefill sweep pulls the model's pread-source pages into the page cache; at high budget this
  competes with the cache slab and thrashes.
- **Place MADV_DONTNEED 2 layers behind the encode front** — this is the key (too close re-faults,
  too far grows the footprint). It keeps the prefill sweep's file-backed footprint to ~2–3 layers.
- **Async-prefetch cold ranges with F_RDADVISE** (on LRU miss). The prefill cache slab is returned
  to the OS on gen resume.
- **nommap prefill overlap**: background-prefetch the entire gate/up/down expert tensors of layer
  il+1 (per-parity double-buffer), overlapping the F_NOCACHE SSD read with layer il's GEMM.

### Implementation notes (non-obvious OS / Metal behavior)

These are macOS / Metal behaviors that are hard to discover without actually trying or digging
into the docs.

- **F_NOCACHE requires page-aligned I/O.** Interleaving short unaligned reads makes the kernel
  implicitly re-enable caching and disables F_NOCACHE. Widen the request window to page boundaries
  and read in a single pread (page-aligned bounce buffer, 32 MiB ceiling).
- **F_NOCACHE alone does not reduce file-backed.** The kernel's automatic read-ahead ignores
  F_NOCACHE and puts prefetched pages into file-backed, so you must **turn off F_RDAHEAD** on the
  model fd (this cuts gen-time file-backed growth to about 1/3).
- **stdio-based reads (KV / staged reads) also land in file-backed**, so they need to be bypassed
  via a separate path. Whatever the prefill cold-load burst brings in remains as reclaimable cache,
  since **macOS provides no way to drop an fd's cache** (diminishing returns).
- **A Metal noCopy MTLBuffer is counted as wired for its full length** (a gotcha when wrapping mmap
  pages directly).

Shared as reference for the direction of SSD streaming (take speed by depending on the page cache,
or take portability / low memory by dropping the page cache).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSD-streaming benchmark results and improvement proposals #437

Overview

TL;DR

Numbers for representative configs

nommap characteristics (fixed memory footprint, real-machine-representative perf)

Detailed report

Improvement proposals

prefill speedup notes

Implementation notes (non-obvious OS / Metal behavior)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

config	prefill	gen	wired	used	file-backed
full (non-stream)	261.0	22.80	—	103	—
baseline b32	249.3	19.29	48	124	83
residency-lru b32	243.9	16.58	51	103	82
nommap b32	214.6	12.16	45	61	2

config	prefill	gen	wired	used	file-backed
baseline b80	113.2	11.00	88	127	116
residency-lru b80	129.2	11.81	101	127	113
nommap b80	126.7	7.59	88	106	2

budget(GB)	prefill t/s	gen t/s	wired	used	file-backed
8	206.5	4.52	21	36	2.8
16	210.1	6.54	31	44	2.3
32	214.6	12.16	45	61	2.1
64	215.9	19.57	77	92	1.5
80	215.2	19.92	85	100	1.9

SSD-streaming benchmark results and improvement proposals #437

Description

Overview

TL;DR

Numbers for representative configs

nommap characteristics (fixed memory footprint, real-machine-representative perf)

Detailed report

Improvement proposals

prefill speedup notes

Implementation notes (non-obvious OS / Metal behavior)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions