Skip to content

model-loader: add --reclaim-mmap-source to drop dormant mmap pages (Fixes #16761)#24156

Open
markkobo wants to merge 1 commit into
ggml-org:masterfrom
markkobo:feature/cpu-repack-reclaim
Open

model-loader: add --reclaim-mmap-source to drop dormant mmap pages (Fixes #16761)#24156
markkobo wants to merge 1 commit into
ggml-org:masterfrom
markkobo:feature/cpu-repack-reclaim

Conversation

@markkobo
Copy link
Copy Markdown

@markkobo markkobo commented Jun 5, 2026

Overview

After a tensor is copied out of the mmap into a separate buffer (e.g. CPU weight repacking), the read-only file-backed source pages are not used but still sit in the RSS.

With --reclaim-mmap-source,

madvise(MADV_DONTNEED) drops them; the mapping stays valid and re-faults from the file if ever touched.

Linux only, flag-gated, skipped under --mlock. The range is rounded inward to whole pages. Fixes #16761.
It saves 13gb(37%) of RSS memory for Qwen3-30B-A3B model.

default OFF, but since it's a bug fix for issue #16761 (#16761)
So I'm happy to flip the default to ON and/or use an ENV VAR for a clean, prepared-to-remove variable.

Additional information

All the Validations/Tests:

Setup

# CPU build (the repack path needs the native CPU backend)
cmake -B build -DGGML_NATIVE=ON
cmake --build build -j --target llama-completion llama-perplexity

# A repack-eligible quant under mmap (Q4_0 / Q4_K / Q8_0 / IQ4_NL ...).
# Validated on: Qwen3-30B-A3B Q4_K_M, DeepSeek-V2-Lite Q4_K_M, Llama-3.2-3B Q4_0.
MODEL=/path/to/model.gguf
T=$(nproc)

# Same args for every run (bash array → no word-splitting on the prompt)
ARGS=(-m "$MODEL" -p "The capital of France is" -n 128 --temp 0 --seed 1 -c 4096 -t "$T" --no-display-prompt)

# Sanity: confirm the repack path actually fires for this model/CPU
./build/bin/llama-completion "${ARGS[@]}" 2>&1 >/dev/null | grep -i "CPU_REPACK model buffer"
# e.g. "load_tensors: CPU_REPACK model buffer size = 13432.50 MiB"

1. Correctness — generation is byte-identical

./build/bin/llama-completion "${ARGS[@]}"                       >off.txt 2>/dev/null
./build/bin/llama-completion "${ARGS[@]}" --reclaim-mmap-source >on.txt  2>/dev/null
cmp off.txt on.txt && echo "PASS: byte-identical output"

2. Correctness — perplexity is identical (strongest signal)

bash scripts/get-wikitext-2.sh   # ~280 KB
WIKI=wikitext-2-raw/wiki.test.raw
./build/bin/llama-perplexity -m "$MODEL" -f "$WIKI" --chunks 64 -t "$T"
./build/bin/llama-perplexity -m "$MODEL" -f "$WIKI" --chunks 64 -t "$T" --reclaim-mmap-source
# expect the SAME "Final estimate: PPL = ..." from both
# observed (Qwen3-30B-A3B Q4_K_M): PPL = 9.0814 +/- 0.21861  (on == off)

3. Memory saving + no page-fault penalty (GNU time)

/usr/bin/time -v ./build/bin/llama-completion "${ARGS[@]}"                       2>&1 >/dev/null | grep -E "Maximum resident|Major .* page faults"
/usr/bin/time -v ./build/bin/llama-completion "${ARGS[@]}" --reclaim-mmap-source 2>&1 >/dev/null | grep -E "Maximum resident|Major .* page faults"

Observed peak RSS (-c 4096), Major page faults = 0 in every run:

model quant / arch OFF ON saving CPU_REPACK buffer
Llama-3.2-3B Q4_0 / dense 3.73 GiB 2.30 GiB −1.43 GiB (38%) 1.47 GiB
DeepSeek-V2-Lite Q4_K_M / MoE 16.19 GiB 10.75 GiB −5.44 GiB (34%) 5.58 GiB
Qwen3-30B-A3B Q4_K_M / MoE ~34 GiB ~21 GiB −13.0 GiB (37%) 13.4 GiB

The saving tracks the CPU_REPACK buffer size (the duplicated source). majflt=0
confirms the dropped pages are never re-faulted on the hot path.

4. Inertness — flag is a no-op under --no-mmap and --mlock

./build/bin/llama-completion "${ARGS[@]}" --no-mmap --reclaim-mmap-source >nommap.txt 2>/dev/null
./build/bin/llama-completion "${ARGS[@]}" --mlock   --reclaim-mmap-source >mlock.txt  2>/dev/null
cmp off.txt nommap.txt && cmp off.txt mlock.txt && echo "PASS: inert under --no-mmap and --mlock"
# (--no-mmap: the code path never runs; --mlock: lmlocks guard skips the madvise)

Notes for reviewers

  • Behavior is unchanged by construction — the flag only madvise(MADV_DONTNEED)s
    the read-only file-backed source after it has been copied out; any access
    re-faults identical bytes from the file.
  • Linux-only (#if defined(MADV_DONTNEED)); compiles to a no-op elsewhere.
  • Detailed before/after VMA breakdown (model_mmap 17.6 → 4.3 GiB on Qwen3) via
    /proc/<pid>/smaps.

Requirements

  • helped me trace the code, generate the flag plumbing from the existing use_mmap pattern,
  • helped me run the benchmarks, draft comments.

I edited all the code change, reviewed, and refactored in this PR, and accept full responsibility.

…ages

After a tensor is copied out of the mmap into a separate buffer (e.g. CPU
weight repacking), the read-only file-backed source pages are dormant but
still sit in the RSS.

With --reclaim-mmap-source,

madvise(MADV_DONTNEED) drops them; the mapping stays valid and re-faults from the file if ever
touched. Linux only, flag-gated, skipped under --mlock. The range is rounded
inward to whole pages. Addresses ggml-org#16761.

default OFF, but since it's a bug fix for issue ggml-org#16761 (ggml-org#16761)

So I'm happy to flip default to ON and/or use a ENV VAR for clean prepared-to-remove variable.

```bash
./build/bin/llama-completion "${ARGS[@]}"                       >off.txt 2>/dev/null
./build/bin/llama-completion "${ARGS[@]}" --reclaim-mmap-source >on.txt  2>/dev/null
cmp off.txt on.txt && echo "PASS: byte-identical output"
```

```bash
bash scripts/get-wikitext-2.sh   # ~280 KB
WIKI=wikitext-2-raw/wiki.test.raw
./build/bin/llama-perplexity -m "$MODEL" -f "$WIKI" --chunks 64 -t "$T"
./build/bin/llama-perplexity -m "$MODEL" -f "$WIKI" --chunks 64 -t "$T" --reclaim-mmap-source
```

```bash
/usr/bin/time -v ./build/bin/llama-completion "${ARGS[@]}"                       2>&1 >/dev/null | grep -E "Maximum resident|Major .* page faults"
/usr/bin/time -v ./build/bin/llama-completion "${ARGS[@]}" --reclaim-mmap-source 2>&1 >/dev/null | grep -E "Maximum resident|Major .* page faults"
```

Observed peak RSS (`-c 4096`), `Major page faults = 0` in every run:

| model | quant / arch | OFF | ON | saving | CPU_REPACK buffer |
|---|---|---:|---:|---:|---:|
| Llama-3.2-3B | Q4_0 / dense | 3.73 GiB | 2.30 GiB | −1.43 GiB (38%) | 1.47 GiB |
| DeepSeek-V2-Lite | Q4_K_M / MoE | 16.19 GiB | 10.75 GiB | −5.44 GiB (34%) | 5.58 GiB |
| Qwen3-30B-A3B | Q4_K_M / MoE | ~34 GiB | ~21 GiB | −13.0 GiB (37%) | 13.4 GiB |

The saving tracks the CPU_REPACK buffer size (the duplicated source). `majflt=0`
confirms the dropped pages are never re-faulted on the hot path.

```bash
./build/bin/llama-completion "${ARGS[@]}" --no-mmap --reclaim-mmap-source >nommap.txt 2>/dev/null
./build/bin/llama-completion "${ARGS[@]}" --mlock   --reclaim-mmap-source >mlock.txt  2>/dev/null
cmp off.txt nommap.txt && cmp off.txt mlock.txt && echo "PASS: inert under --no-mmap and --mlock"
```

- Behavior is unchanged by construction — the flag only `madvise(MADV_DONTNEED)`s
  the read-only file-backed source after it has been copied out; any access
  re-faults identical bytes from the file.
- Linux-only (`#if defined(MADV_DONTNEED)`); compiles to a no-op elsewhere.
- Detailed before/after VMA breakdown (`model_mmap` 17.6 → 4.3 GiB on Qwen3) via
  `/proc/<pid>/smaps`.
@markkobo markkobo requested review from a team, CISC and ggerganov as code owners June 5, 2026 05:27
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Jun 5, 2026

Hi @markkobo, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@markkobo markkobo changed the title FIX #16761 - model-loader : add --reclaim-mmap-source to drop dormant mmap p… FIX #16761 - model-loader : add --reclaim-mmap-source to drop dormant mmap pages Jun 5, 2026
@markkobo markkobo changed the title FIX #16761 - model-loader : add --reclaim-mmap-source to drop dormant mmap pages model-loader : add --reclaim-mmap-source to drop dormant mmap pages (FIX #16761) Jun 5, 2026
@markkobo
Copy link
Copy Markdown
Author

markkobo commented Jun 5, 2026

Hi @markkobo, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

AI usage disclosure:
YES, AI helped trace the code, generate the flag plumbing from the existing use_mmap pattern, run the benchmarks, draft docs.

@markkobo markkobo changed the title model-loader : add --reclaim-mmap-source to drop dormant mmap pages (FIX #16761) model-loader : add --reclaim-mmap-source to drop dormant mmap pages (Fixes #16761) Jun 5, 2026
@markkobo markkobo changed the title model-loader : add --reclaim-mmap-source to drop dormant mmap pages (Fixes #16761) model-loader: add --reclaim-mmap-source to drop dormant mmap pages (Fixes #16761) Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: memory usage increased when using repacking

1 participant