rpc: reduce small-message overhead and tune cache probing by Donovoi · Pull Request #24122 · ggml-org/llama.cpp

Donovoi · 2026-06-04T12:34:37Z

Summary

coalesce small RPC command/header writes to reduce TCP message overhead
make the RPC tensor cache probe/save threshold configurable via GGML_RPC_CACHE_MIN_SIZE
avoid client hash probes when the server reports that it has no cache
add local and remote RPC benchmark helpers, including per-case env overrides for one-sided experiments
document TCP buffer tuning and separate client-side cache probing from per-node server-side cache saving

Validation

python -m py_compile tools/rpc/bench_rpc_remote.py
tools/rpc/bench_rpc_remote.py --dry-run ... --env GGML_RPC_TCP_BUFFER_SIZE=4194304 --patch-env GGML_RPC_CACHE_MIN_SIZE=1048576 confirmed the patch-only env is recorded only for the patch case
WSL CPU/RPC build: cmake -S . -B build-rpc-cpu-check -DGGML_RPC=ON -DGGML_CUDA=OFF -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build build-rpc-cpu-check --target rpc-server llama-bench
Real mixed-host RPC run, Gemma 31B BF16, p16/n16/r1, all caches warm: prompt throughput improved from 2.908077 to 3.430841 tok/s (+17.98%), decode improved from 0.411090 to 0.416935 tok/s (+1.42%), wall time was unchanged (436.15s to 436.17s)
Earlier five-worker warm run without the unstable Ubuntu node: wall time improved -9.32%, decode throughput improved +30.40%, prompt throughput changed -5.69%

Notes

Lowering the server-side cache threshold on every node is not always safe or useful on mixed hardware. The docs now describe client-side probing separately from server-side saving, and the remote benchmark helper can record patch-only client env such as GGML_RPC_CACHE_MIN_SIZE=1048576.

Donovoi · 2026-06-04T23:16:30Z

Follow-up network benchmark on the real mixed RPC fleet (bigboy -> toor, access, mo, ubuntu-gpu, rando, fucky) with Gemma 4 31B Heretic BF16:

Tried an extra code change using vectored header+payload sends plus a small receive-side buffer. It built, but regressed measured token throughput (p16/n64/r1: prompt -11.84%, decode -3.33%), so it was reverted and not included.
Re-tested the existing GGML_RPC_TCP_BUFFER_SIZE knob already in this PR with 4194304 bytes. Across forward/reverse paired runs, average decode improved from 0.418017 to 0.425122 tok/s (+1.70%), prompt improved from 2.975491 to 3.494679 tok/s (+17.45%), and wall time dropped from 564.47s to 540.80s (-4.19%).
Direct LAN IP endpoints plus the 4 MiB buffer were effectively flat on decode versus Tailscale hostnames (+0.02%), so I am not proposing that as a default config change.

Interpretation: the TCP buffer knob is useful as an operational tuning option, especially for prompt/wall behavior, but the follow-up did not find another code change with enough evidence to add to this PR.

Skip exporting the unchanged logits_seq input as sampled logits when a backend sampler leaves data.logits pointing at the original logits. This prevents greedy backend sampling from pulling the full logits tensor back to the host over RPC. Add a llama-bench -bs/--backend-sampling mode that attaches a greedy backend sampler for tg/pg tests and records the mode in benchmark outputs. Document the RPC use case and the raw-logits caveat. Evidence on bigboy coordinator with RPC workers toor/access/mo/rando/fucky: Qwen3 0.6B split RPC0/RPC2/RPC4 1/1/0, n=128 r=5, raw 19.044610 t/s vs backend-sampling 73.717810 t/s; Qwen3.5 35B split RPC6/RPC0/RPC2/RPC4 1/8/8/8, n=16 r=1, raw 1.158185 t/s vs backend-sampling 1.773261 t/s. Assisted-by: Codex

github-actions Bot added examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 4, 2026

Your Name and others added 5 commits June 5, 2026 18:44

Optimize RPC framing and cache probing

5d752de

Add RPC remote benchmark and cache threshold tuning

038514c

Improve RPC remote benchmark single-case runs

4a58e20

Add RPC trace counters

3174387

Add model placement reporting

b28411a

Donovoi force-pushed the master branch from 9b656e9 to b28411a Compare June 5, 2026 09:36

Donovoi added 10 commits June 5, 2026 20:32

Add endpoint detail to RPC trace

f969cc9

Add RPC remote placement benchmark args

cc2e090

Merge remote-tracking branch 'origin/master'

ae21cbb

Add RPC placement sweep tooling

be06fa4

Report RPC remote device types

1531879

Merge remote-tracking branch 'origin/master'

1724f64

Fix RPC placement diagnostics after upstream merge

e2a0a35

Add RPC TCP I/O timeout

da35fc8

Split RPC client and server timeouts

1caa3b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: reduce small-message overhead and tune cache probing#24122

rpc: reduce small-message overhead and tune cache probing#24122
Donovoi wants to merge 15 commits into
ggml-org:masterfrom
Donovoi:master

Donovoi commented Jun 4, 2026

Uh oh!

Donovoi commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Donovoi commented Jun 4, 2026

Summary

Validation

Notes

Uh oh!

Donovoi commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant