model-loader: add --reclaim-mmap-source to drop dormant mmap pages (Fixes #16761)#24156
Open
markkobo wants to merge 1 commit into
Open
model-loader: add --reclaim-mmap-source to drop dormant mmap pages (Fixes #16761)#24156markkobo wants to merge 1 commit into
markkobo wants to merge 1 commit into
Conversation
…ages After a tensor is copied out of the mmap into a separate buffer (e.g. CPU weight repacking), the read-only file-backed source pages are dormant but still sit in the RSS. With --reclaim-mmap-source, madvise(MADV_DONTNEED) drops them; the mapping stays valid and re-faults from the file if ever touched. Linux only, flag-gated, skipped under --mlock. The range is rounded inward to whole pages. Addresses ggml-org#16761. default OFF, but since it's a bug fix for issue ggml-org#16761 (ggml-org#16761) So I'm happy to flip default to ON and/or use a ENV VAR for clean prepared-to-remove variable. ```bash ./build/bin/llama-completion "${ARGS[@]}" >off.txt 2>/dev/null ./build/bin/llama-completion "${ARGS[@]}" --reclaim-mmap-source >on.txt 2>/dev/null cmp off.txt on.txt && echo "PASS: byte-identical output" ``` ```bash bash scripts/get-wikitext-2.sh # ~280 KB WIKI=wikitext-2-raw/wiki.test.raw ./build/bin/llama-perplexity -m "$MODEL" -f "$WIKI" --chunks 64 -t "$T" ./build/bin/llama-perplexity -m "$MODEL" -f "$WIKI" --chunks 64 -t "$T" --reclaim-mmap-source ``` ```bash /usr/bin/time -v ./build/bin/llama-completion "${ARGS[@]}" 2>&1 >/dev/null | grep -E "Maximum resident|Major .* page faults" /usr/bin/time -v ./build/bin/llama-completion "${ARGS[@]}" --reclaim-mmap-source 2>&1 >/dev/null | grep -E "Maximum resident|Major .* page faults" ``` Observed peak RSS (`-c 4096`), `Major page faults = 0` in every run: | model | quant / arch | OFF | ON | saving | CPU_REPACK buffer | |---|---|---:|---:|---:|---:| | Llama-3.2-3B | Q4_0 / dense | 3.73 GiB | 2.30 GiB | −1.43 GiB (38%) | 1.47 GiB | | DeepSeek-V2-Lite | Q4_K_M / MoE | 16.19 GiB | 10.75 GiB | −5.44 GiB (34%) | 5.58 GiB | | Qwen3-30B-A3B | Q4_K_M / MoE | ~34 GiB | ~21 GiB | −13.0 GiB (37%) | 13.4 GiB | The saving tracks the CPU_REPACK buffer size (the duplicated source). `majflt=0` confirms the dropped pages are never re-faulted on the hot path. ```bash ./build/bin/llama-completion "${ARGS[@]}" --no-mmap --reclaim-mmap-source >nommap.txt 2>/dev/null ./build/bin/llama-completion "${ARGS[@]}" --mlock --reclaim-mmap-source >mlock.txt 2>/dev/null cmp off.txt nommap.txt && cmp off.txt mlock.txt && echo "PASS: inert under --no-mmap and --mlock" ``` - Behavior is unchanged by construction — the flag only `madvise(MADV_DONTNEED)`s the read-only file-backed source after it has been copied out; any access re-faults identical bytes from the file. - Linux-only (`#if defined(MADV_DONTNEED)`); compiles to a no-op elsewhere. - Detailed before/after VMA breakdown (`model_mmap` 17.6 → 4.3 GiB on Qwen3) via `/proc/<pid>/smaps`.
|
Hi @markkobo, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
Author
AI usage disclosure: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
After a tensor is copied out of the mmap into a separate buffer (e.g. CPU weight repacking), the read-only file-backed source pages are not used but still sit in the RSS.
With --reclaim-mmap-source,
madvise(MADV_DONTNEED) drops them; the mapping stays valid and re-faults from the file if ever touched.
Linux only, flag-gated, skipped under --mlock. The range is rounded inward to whole pages. Fixes #16761.
It saves 13gb(37%) of RSS memory for Qwen3-30B-A3B model.
default OFF, but since it's a bug fix for issue #16761 (#16761)
So I'm happy to flip the default to ON and/or use an ENV VAR for a clean, prepared-to-remove variable.
Additional information
All the Validations/Tests:
Setup
1. Correctness — generation is byte-identical
2. Correctness — perplexity is identical (strongest signal)
3. Memory saving + no page-fault penalty (GNU time)
Observed peak RSS (
-c 4096),Major page faults = 0in every run:The saving tracks the CPU_REPACK buffer size (the duplicated source).
majflt=0confirms the dropped pages are never re-faulted on the hot path.
4. Inertness — flag is a no-op under
--no-mmapand--mlockNotes for reviewers
madvise(MADV_DONTNEED)sthe read-only file-backed source after it has been copied out; any access
re-faults identical bytes from the file.
#if defined(MADV_DONTNEED)); compiles to a no-op elsewhere.model_mmap17.6 → 4.3 GiB on Qwen3) via/proc/<pid>/smaps.Requirements
YES, AI
I edited all the code change, reviewed, and refactored in this PR, and accept full responsibility.