@Duyi-Wang @billishyahao @ZhaiFeiyue @TianDi101
Summary
We observed a large theoretical KV cache hit-rate difference between the older AgentX trace replay dataset and the newer v0.2 no-subagents dataset. This difference is visible even when looking at the full dataset theoretical hit rate, and it becomes much larger after applying the default MAX_CONTEXT=131072 constraint used by the replay scripts.
This means the dataset switch itself changes the expected KV cache behavior, independent of serving engine, scheduler, offload, or eviction policy changes.
Datasets Compared
| Dataset |
Full theoretical token hit rate |
Theoretical hit rate closer to measured replay with MAX_CONTEXT=131072 |
semianalysisai/cc-traces-weka-042026 |
~96.8% |
~96.0-96.2% |
semianalysisai/cc-traces-weka-no-subagents-051226 |
~85.8% |
~65.6-69.9% |
Methodology
The numbers above were computed offline from the hash_ids field in each trace.
For each trace, I modeled an infinite local prefix cache:
- Cache scope is per trace because
hash_id_scope=local.
- For each request, cache-hit tokens are computed from the longest contiguous prefix of
hash_ids that matches any prior request in the same trace.
- Block size is 64 tokens.
- Cross-trace cache sharing is not counted.
- The
MAX_CONTEXT=131072 estimate filters or truncates requests according to the replay-time context limit, to approximate what the scripts can actually send to the server.
Observations
The older cc-traces-weka-042026 dataset is highly cache-friendly. Its prompts mostly grow monotonically across turns, so most requests reuse nearly the full previous context.
The newer cc-traces-weka-no-subagents-051226 dataset is more realistic but much harder for prefix caching. It includes longer traces and more context compaction/reset behavior.
Key point: even before applying any context-length filter, the full theoretical token hit rate differs substantially: ~96.8% vs ~85.8%. That is already an ~11 percentage point drop caused by dataset distribution alone.
After applying MAX_CONTEXT=131072, the gap becomes much larger:
- Older dataset remains around ~96.0-96.2%.
- Newer dataset drops to ~65.6-69.9%.
This indicates that the newer dataset contains a much larger fraction of long-context traces whose cache-hit distribution changes significantly when the max-context constraint is applied.
Why This Is a Problem
When comparing benchmark runs across dataset versions, a lower KV cache hit rate may be incorrectly attributed to the serving engine, scheduler, offload implementation, or cache eviction policy.
In reality, part of the difference comes from the dataset distribution itself, even at full-dataset theoretical level, and another large part comes from the interaction with MAX_CONTEXT.
This makes it hard to answer questions like:
- Did the cache implementation regress?
- Did the scheduler/offload policy reduce hit rate?
- Or did the dataset switch change the expected theoretical hit-rate ceiling?
Suggested Improvements
It would be helpful if the benchmark tooling reported dataset-level theoretical cache-hit ceilings before replay starts, for example:
- Full theoretical prefix-cache token hit rate.
- Effective theoretical prefix-cache token hit rate after
MAX_CONTEXT.
- Number of requests/traces excluded or truncated by
MAX_CONTEXT.
- Per-request hit-rate distribution, not only aggregate token-weighted hit rate.
This would make benchmark comparisons across dataset versions much easier to interpret.
@Duyi-Wang @billishyahao @ZhaiFeiyue @TianDi101
Summary
We observed a large theoretical KV cache hit-rate difference between the older AgentX trace replay dataset and the newer v0.2 no-subagents dataset. This difference is visible even when looking at the full dataset theoretical hit rate, and it becomes much larger after applying the default
MAX_CONTEXT=131072constraint used by the replay scripts.This means the dataset switch itself changes the expected KV cache behavior, independent of serving engine, scheduler, offload, or eviction policy changes.
Datasets Compared
MAX_CONTEXT=131072semianalysisai/cc-traces-weka-042026semianalysisai/cc-traces-weka-no-subagents-051226Methodology
The numbers above were computed offline from the
hash_idsfield in each trace.For each trace, I modeled an infinite local prefix cache:
hash_id_scope=local.hash_idsthat matches any prior request in the same trace.MAX_CONTEXT=131072estimate filters or truncates requests according to the replay-time context limit, to approximate what the scripts can actually send to the server.Observations
The older
cc-traces-weka-042026dataset is highly cache-friendly. Its prompts mostly grow monotonically across turns, so most requests reuse nearly the full previous context.The newer
cc-traces-weka-no-subagents-051226dataset is more realistic but much harder for prefix caching. It includes longer traces and more context compaction/reset behavior.Key point: even before applying any context-length filter, the full theoretical token hit rate differs substantially: ~96.8% vs ~85.8%. That is already an ~11 percentage point drop caused by dataset distribution alone.
After applying
MAX_CONTEXT=131072, the gap becomes much larger:This indicates that the newer dataset contains a much larger fraction of long-context traces whose cache-hit distribution changes significantly when the max-context constraint is applied.
Why This Is a Problem
When comparing benchmark runs across dataset versions, a lower KV cache hit rate may be incorrectly attributed to the serving engine, scheduler, offload implementation, or cache eviction policy.
In reality, part of the difference comes from the dataset distribution itself, even at full-dataset theoretical level, and another large part comes from the interaction with
MAX_CONTEXT.This makes it hard to answer questions like:
Suggested Improvements
It would be helpful if the benchmark tooling reported dataset-level theoretical cache-hit ceilings before replay starts, for example:
MAX_CONTEXT.MAX_CONTEXT.This would make benchmark comparisons across dataset versions much easier to interpret.