Skip to content

rpc: reduce small-message overhead and tune cache probing#24122

Draft
Donovoi wants to merge 15 commits into
ggml-org:masterfrom
Donovoi:master
Draft

rpc: reduce small-message overhead and tune cache probing#24122
Donovoi wants to merge 15 commits into
ggml-org:masterfrom
Donovoi:master

Conversation

@Donovoi
Copy link
Copy Markdown

@Donovoi Donovoi commented Jun 4, 2026

Summary

  • coalesce small RPC command/header writes to reduce TCP message overhead
  • make the RPC tensor cache probe/save threshold configurable via GGML_RPC_CACHE_MIN_SIZE
  • avoid client hash probes when the server reports that it has no cache
  • add local and remote RPC benchmark helpers, including per-case env overrides for one-sided experiments
  • document TCP buffer tuning and separate client-side cache probing from per-node server-side cache saving

Validation

  • python -m py_compile tools/rpc/bench_rpc_remote.py
  • tools/rpc/bench_rpc_remote.py --dry-run ... --env GGML_RPC_TCP_BUFFER_SIZE=4194304 --patch-env GGML_RPC_CACHE_MIN_SIZE=1048576 confirmed the patch-only env is recorded only for the patch case
  • WSL CPU/RPC build: cmake -S . -B build-rpc-cpu-check -DGGML_RPC=ON -DGGML_CUDA=OFF -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build build-rpc-cpu-check --target rpc-server llama-bench
  • Real mixed-host RPC run, Gemma 31B BF16, p16/n16/r1, all caches warm: prompt throughput improved from 2.908077 to 3.430841 tok/s (+17.98%), decode improved from 0.411090 to 0.416935 tok/s (+1.42%), wall time was unchanged (436.15s to 436.17s)
  • Earlier five-worker warm run without the unstable Ubuntu node: wall time improved -9.32%, decode throughput improved +30.40%, prompt throughput changed -5.69%

Notes

Lowering the server-side cache threshold on every node is not always safe or useful on mixed hardware. The docs now describe client-side probing separately from server-side saving, and the remote benchmark helper can record patch-only client env such as GGML_RPC_CACHE_MIN_SIZE=1048576.

@github-actions github-actions Bot added examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 4, 2026
@Donovoi
Copy link
Copy Markdown
Author

Donovoi commented Jun 4, 2026

Follow-up network benchmark on the real mixed RPC fleet (bigboy -> toor, access, mo, ubuntu-gpu, rando, fucky) with Gemma 4 31B Heretic BF16:

  • Tried an extra code change using vectored header+payload sends plus a small receive-side buffer. It built, but regressed measured token throughput (p16/n64/r1: prompt -11.84%, decode -3.33%), so it was reverted and not included.
  • Re-tested the existing GGML_RPC_TCP_BUFFER_SIZE knob already in this PR with 4194304 bytes. Across forward/reverse paired runs, average decode improved from 0.418017 to 0.425122 tok/s (+1.70%), prompt improved from 2.975491 to 3.494679 tok/s (+17.45%), and wall time dropped from 564.47s to 540.80s (-4.19%).
  • Direct LAN IP endpoints plus the 4 MiB buffer were effectively flat on decode versus Tailscale hostnames (+0.02%), so I am not proposing that as a default config change.

Interpretation: the TCP buffer knob is useful as an operational tuning option, especially for prompt/wall behavior, but the follow-up did not find another code change with enough evidence to add to this PR.

Donovoi added 10 commits June 5, 2026 20:32
Skip exporting the unchanged logits_seq input as sampled logits when a backend sampler leaves data.logits pointing at the original logits. This prevents greedy backend sampling from pulling the full logits tensor back to the host over RPC.

Add a llama-bench -bs/--backend-sampling mode that attaches a greedy backend sampler for tg/pg tests and records the mode in benchmark outputs. Document the RPC use case and the raw-logits caveat.

Evidence on bigboy coordinator with RPC workers toor/access/mo/rando/fucky: Qwen3 0.6B split RPC0/RPC2/RPC4 1/1/0, n=128 r=5, raw 19.044610 t/s vs backend-sampling 73.717810 t/s; Qwen3.5 35B split RPC6/RPC0/RPC2/RPC4 1/8/8/8, n=16 r=1, raw 1.158185 t/s vs backend-sampling 1.773261 t/s.

Assisted-by: Codex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant