model-loader: add --reclaim-mmap-source to drop dormant mmap pages (Fixes #16761) by markkobo · Pull Request #24156 · ggml-org/llama.cpp

markkobo · 2026-06-05T05:27:43Z

Overview

After a tensor is copied out of the mmap into a separate buffer (e.g. CPU weight repacking), the read-only file-backed source pages are not used but still sit in the RSS.

With --reclaim-mmap-source,

madvise(MADV_DONTNEED) drops them; the mapping stays valid and re-faults from the file if ever touched.

Linux only, flag-gated, skipped under --mlock. The range is rounded inward to whole pages. Fixes #16761.
It saves 13gb(37%) of RSS memory for Qwen3-30B-A3B model.

default OFF, but since it's a bug fix for issue #16761 (#16761)
So I'm happy to flip the default to ON and/or use an ENV VAR for a clean, prepared-to-remove variable.

Additional information

All the Validations/Tests:

Setup

# CPU build (the repack path needs the native CPU backend)
cmake -B build -DGGML_NATIVE=ON
cmake --build build -j --target llama-completion llama-perplexity

# A repack-eligible quant under mmap (Q4_0 / Q4_K / Q8_0 / IQ4_NL ...).
# Validated on: Qwen3-30B-A3B Q4_K_M, DeepSeek-V2-Lite Q4_K_M, Llama-3.2-3B Q4_0.
MODEL=/path/to/model.gguf
T=$(nproc)

# Same args for every run (bash array → no word-splitting on the prompt)
ARGS=(-m "$MODEL" -p "The capital of France is" -n 128 --temp 0 --seed 1 -c 4096 -t "$T" --no-display-prompt)

# Sanity: confirm the repack path actually fires for this model/CPU
./build/bin/llama-completion "${ARGS[@]}" 2>&1 >/dev/null | grep -i "CPU_REPACK model buffer"
# e.g. "load_tensors: CPU_REPACK model buffer size = 13432.50 MiB"

1. Correctness — generation is byte-identical

./build/bin/llama-completion "${ARGS[@]}"                       >off.txt 2>/dev/null
./build/bin/llama-completion "${ARGS[@]}" --reclaim-mmap-source >on.txt  2>/dev/null
cmp off.txt on.txt && echo "PASS: byte-identical output"

2. Correctness — perplexity is identical (strongest signal)

bash scripts/get-wikitext-2.sh   # ~280 KB
WIKI=wikitext-2-raw/wiki.test.raw
./build/bin/llama-perplexity -m "$MODEL" -f "$WIKI" --chunks 64 -t "$T"
./build/bin/llama-perplexity -m "$MODEL" -f "$WIKI" --chunks 64 -t "$T" --reclaim-mmap-source
# expect the SAME "Final estimate: PPL = ..." from both
# observed (Qwen3-30B-A3B Q4_K_M): PPL = 9.0814 +/- 0.21861  (on == off)

3. Memory saving + no page-fault penalty (GNU time)

/usr/bin/time -v ./build/bin/llama-completion "${ARGS[@]}"                       2>&1 >/dev/null | grep -E "Maximum resident|Major .* page faults"
/usr/bin/time -v ./build/bin/llama-completion "${ARGS[@]}" --reclaim-mmap-source 2>&1 >/dev/null | grep -E "Maximum resident|Major .* page faults"

Observed peak RSS (-c 4096), Major page faults = 0 in every run:

model	quant / arch	OFF	ON	saving	CPU_REPACK buffer
Llama-3.2-3B	Q4_0 / dense	3.73 GiB	2.30 GiB	−1.43 GiB (38%)	1.47 GiB
DeepSeek-V2-Lite	Q4_K_M / MoE	16.19 GiB	10.75 GiB	−5.44 GiB (34%)	5.58 GiB
Qwen3-30B-A3B	Q4_K_M / MoE	~34 GiB	~21 GiB	−13.0 GiB (37%)	13.4 GiB

The saving tracks the CPU_REPACK buffer size (the duplicated source). majflt=0
confirms the dropped pages are never re-faulted on the hot path.

4. Inertness — flag is a no-op under `--no-mmap` and `--mlock`

./build/bin/llama-completion "${ARGS[@]}" --no-mmap --reclaim-mmap-source >nommap.txt 2>/dev/null
./build/bin/llama-completion "${ARGS[@]}" --mlock   --reclaim-mmap-source >mlock.txt  2>/dev/null
cmp off.txt nommap.txt && cmp off.txt mlock.txt && echo "PASS: inert under --no-mmap and --mlock"
# (--no-mmap: the code path never runs; --mlock: lmlocks guard skips the madvise)

Notes for reviewers

Behavior is unchanged by construction — the flag only madvise(MADV_DONTNEED)s
the read-only file-backed source after it has been copied out; any access
re-faults identical bytes from the file.
Linux-only (#if defined(MADV_DONTNEED)); compiles to a no-op elsewhere.
Detailed before/after VMA breakdown (model_mmap 17.6 → 4.3 GiB on Qwen3) via
/proc/<pid>/smaps.

Requirements

I have read and agree with the contributing guidelines: YES
AI usage disclosure:
YES, AI

helped me trace the code, generate the flag plumbing from the existing use_mmap pattern,
helped me run the benchmarks, draft comments.

I edited all the code change, reviewed, and refactored in this PR, and accept full responsibility.

…ages After a tensor is copied out of the mmap into a separate buffer (e.g. CPU weight repacking), the read-only file-backed source pages are dormant but still sit in the RSS. With --reclaim-mmap-source, madvise(MADV_DONTNEED) drops them; the mapping stays valid and re-faults from the file if ever touched. Linux only, flag-gated, skipped under --mlock. The range is rounded inward to whole pages. Addresses ggml-org#16761. default OFF, but since it's a bug fix for issue ggml-org#16761 (ggml-org#16761) So I'm happy to flip default to ON and/or use a ENV VAR for clean prepared-to-remove variable. ```bash ./build/bin/llama-completion "${ARGS[@]}" >off.txt 2>/dev/null ./build/bin/llama-completion "${ARGS[@]}" --reclaim-mmap-source >on.txt 2>/dev/null cmp off.txt on.txt && echo "PASS: byte-identical output" ``` ```bash bash scripts/get-wikitext-2.sh # ~280 KB WIKI=wikitext-2-raw/wiki.test.raw ./build/bin/llama-perplexity -m "$MODEL" -f "$WIKI" --chunks 64 -t "$T" ./build/bin/llama-perplexity -m "$MODEL" -f "$WIKI" --chunks 64 -t "$T" --reclaim-mmap-source ``` ```bash /usr/bin/time -v ./build/bin/llama-completion "${ARGS[@]}" 2>&1 >/dev/null | grep -E "Maximum resident|Major .* page faults" /usr/bin/time -v ./build/bin/llama-completion "${ARGS[@]}" --reclaim-mmap-source 2>&1 >/dev/null | grep -E "Maximum resident|Major .* page faults" ``` Observed peak RSS (`-c 4096`), `Major page faults = 0` in every run: | model | quant / arch | OFF | ON | saving | CPU_REPACK buffer | |---|---|---:|---:|---:|---:| | Llama-3.2-3B | Q4_0 / dense | 3.73 GiB | 2.30 GiB | −1.43 GiB (38%) | 1.47 GiB | | DeepSeek-V2-Lite | Q4_K_M / MoE | 16.19 GiB | 10.75 GiB | −5.44 GiB (34%) | 5.58 GiB | | Qwen3-30B-A3B | Q4_K_M / MoE | ~34 GiB | ~21 GiB | −13.0 GiB (37%) | 13.4 GiB | The saving tracks the CPU_REPACK buffer size (the duplicated source). `majflt=0` confirms the dropped pages are never re-faulted on the hot path. ```bash ./build/bin/llama-completion "${ARGS[@]}" --no-mmap --reclaim-mmap-source >nommap.txt 2>/dev/null ./build/bin/llama-completion "${ARGS[@]}" --mlock --reclaim-mmap-source >mlock.txt 2>/dev/null cmp off.txt nommap.txt && cmp off.txt mlock.txt && echo "PASS: inert under --no-mmap and --mlock" ``` - Behavior is unchanged by construction — the flag only `madvise(MADV_DONTNEED)`s the read-only file-backed source after it has been copied out; any access re-faults identical bytes from the file. - Linux-only (`#if defined(MADV_DONTNEED)`); compiles to a no-op elsewhere. - Detailed before/after VMA breakdown (`model_mmap` 17.6 → 4.3 GiB on Qwen3) via `/proc/<pid>/smaps`.

ggml-gh-bot · 2026-06-05T05:32:08Z

Hi @markkobo, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

markkobo · 2026-06-05T13:50:27Z

Hi @markkobo, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

AI usage disclosure:
YES, AI helped trace the code, generate the flag plumbing from the existing use_mmap pattern, run the benchmarks, draft docs.

markkobo requested review from a team, CISC and ggerganov as code owners June 5, 2026 05:27

markkobo changed the title ~~FIX #16761 - model-loader : add --reclaim-mmap-source to drop dormant mmap p…~~ FIX #16761 - model-loader : add --reclaim-mmap-source to drop dormant mmap pages Jun 5, 2026

markkobo changed the title ~~FIX #16761 - model-loader : add --reclaim-mmap-source to drop dormant mmap pages~~ model-loader : add --reclaim-mmap-source to drop dormant mmap pages (FIX #16761) Jun 5, 2026

markkobo mentioned this pull request Jun 5, 2026

Misc. bug: memory usage increased when using repacking #16761

Closed

markkobo changed the title ~~model-loader : add --reclaim-mmap-source to drop dormant mmap pages (FIX #16761)~~ model-loader : add --reclaim-mmap-source to drop dormant mmap pages (Fixes #16761) Jun 5, 2026

markkobo changed the title ~~model-loader : add --reclaim-mmap-source to drop dormant mmap pages (Fixes #16761)~~ model-loader: add --reclaim-mmap-source to drop dormant mmap pages (Fixes #16761) Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model-loader: add --reclaim-mmap-source to drop dormant mmap pages (Fixes #16761)#24156

model-loader: add --reclaim-mmap-source to drop dormant mmap pages (Fixes #16761)#24156
markkobo wants to merge 1 commit into
ggml-org:masterfrom
markkobo:feature/cpu-repack-reclaim

markkobo commented Jun 5, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot Bot commented Jun 5, 2026

Uh oh!

markkobo commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

markkobo commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

All the Validations/Tests:

Setup

1. Correctness — generation is byte-identical

2. Correctness — perplexity is identical (strongest signal)

3. Memory saving + no page-fault penalty (GNU time)

4. Inertness — flag is a no-op under --no-mmap and --mlock

Notes for reviewers

Requirements

Uh oh!

ggml-gh-bot Bot commented Jun 5, 2026

Uh oh!

markkobo commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

markkobo commented Jun 5, 2026 •

edited

Loading

4. Inertness — flag is a no-op under `--no-mmap` and `--mlock`