rpc: reduce small-message overhead and tune cache probing#24122
Draft
Donovoi wants to merge 15 commits into
Draft
rpc: reduce small-message overhead and tune cache probing#24122Donovoi wants to merge 15 commits into
Donovoi wants to merge 15 commits into
Conversation
Author
|
Follow-up network benchmark on the real mixed RPC fleet (
Interpretation: the TCP buffer knob is useful as an operational tuning option, especially for prompt/wall behavior, but the follow-up did not find another code change with enough evidence to add to this PR. |
Skip exporting the unchanged logits_seq input as sampled logits when a backend sampler leaves data.logits pointing at the original logits. This prevents greedy backend sampling from pulling the full logits tensor back to the host over RPC. Add a llama-bench -bs/--backend-sampling mode that attaches a greedy backend sampler for tg/pg tests and records the mode in benchmark outputs. Document the RPC use case and the raw-logits caveat. Evidence on bigboy coordinator with RPC workers toor/access/mo/rando/fucky: Qwen3 0.6B split RPC0/RPC2/RPC4 1/1/0, n=128 r=5, raw 19.044610 t/s vs backend-sampling 73.717810 t/s; Qwen3.5 35B split RPC6/RPC0/RPC2/RPC4 1/8/8/8, n=16 r=1, raw 1.158185 t/s vs backend-sampling 1.773261 t/s. Assisted-by: Codex
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GGML_RPC_CACHE_MIN_SIZEValidation
python -m py_compile tools/rpc/bench_rpc_remote.pytools/rpc/bench_rpc_remote.py --dry-run ... --env GGML_RPC_TCP_BUFFER_SIZE=4194304 --patch-env GGML_RPC_CACHE_MIN_SIZE=1048576confirmed the patch-only env is recorded only for the patch casecmake -S . -B build-rpc-cpu-check -DGGML_RPC=ON -DGGML_CUDA=OFF -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build build-rpc-cpu-check --target rpc-server llama-benchp16/n16/r1, all caches warm: prompt throughput improved from2.908077to3.430841tok/s (+17.98%), decode improved from0.411090to0.416935tok/s (+1.42%), wall time was unchanged (436.15sto436.17s)-9.32%, decode throughput improved+30.40%, prompt throughput changed-5.69%Notes
Lowering the server-side cache threshold on every node is not always safe or useful on mixed hardware. The docs now describe client-side probing separately from server-side saving, and the remote benchmark helper can record patch-only client env such as
GGML_RPC_CACHE_MIN_SIZE=1048576.