Skip to content

breaking changes between MVP v0.1 & MVP v0.2 Agentic trace dataset which significantly changes theoretical KV cache hit rate under MAX_CONTEXT=131072 #1632

@AMD-yanfeiwang

Description

@AMD-yanfeiwang

@Duyi-Wang @billishyahao @ZhaiFeiyue @TianDi101

Summary

We observed a large theoretical KV cache hit-rate difference between the older AgentX trace replay dataset and the newer v0.2 no-subagents dataset. This difference is visible even when looking at the full dataset theoretical hit rate, and it becomes much larger after applying the default MAX_CONTEXT=131072 constraint used by the replay scripts.

This means the dataset switch itself changes the expected KV cache behavior, independent of serving engine, scheduler, offload, or eviction policy changes.

Datasets Compared

Dataset Full theoretical token hit rate Theoretical hit rate closer to measured replay with MAX_CONTEXT=131072
semianalysisai/cc-traces-weka-042026 ~96.8% ~96.0-96.2%
semianalysisai/cc-traces-weka-no-subagents-051226 ~85.8% ~65.6-69.9%

Methodology

The numbers above were computed offline from the hash_ids field in each trace.

For each trace, I modeled an infinite local prefix cache:

  • Cache scope is per trace because hash_id_scope=local.
  • For each request, cache-hit tokens are computed from the longest contiguous prefix of hash_ids that matches any prior request in the same trace.
  • Block size is 64 tokens.
  • Cross-trace cache sharing is not counted.
  • The MAX_CONTEXT=131072 estimate filters or truncates requests according to the replay-time context limit, to approximate what the scripts can actually send to the server.

Observations

The older cc-traces-weka-042026 dataset is highly cache-friendly. Its prompts mostly grow monotonically across turns, so most requests reuse nearly the full previous context.

The newer cc-traces-weka-no-subagents-051226 dataset is more realistic but much harder for prefix caching. It includes longer traces and more context compaction/reset behavior.

Key point: even before applying any context-length filter, the full theoretical token hit rate differs substantially: ~96.8% vs ~85.8%. That is already an ~11 percentage point drop caused by dataset distribution alone.

After applying MAX_CONTEXT=131072, the gap becomes much larger:

  • Older dataset remains around ~96.0-96.2%.
  • Newer dataset drops to ~65.6-69.9%.

This indicates that the newer dataset contains a much larger fraction of long-context traces whose cache-hit distribution changes significantly when the max-context constraint is applied.

Why This Is a Problem

When comparing benchmark runs across dataset versions, a lower KV cache hit rate may be incorrectly attributed to the serving engine, scheduler, offload implementation, or cache eviction policy.

In reality, part of the difference comes from the dataset distribution itself, even at full-dataset theoretical level, and another large part comes from the interaction with MAX_CONTEXT.

This makes it hard to answer questions like:

  • Did the cache implementation regress?
  • Did the scheduler/offload policy reduce hit rate?
  • Or did the dataset switch change the expected theoretical hit-rate ceiling?

Suggested Improvements

It would be helpful if the benchmark tooling reported dataset-level theoretical cache-hit ceilings before replay starts, for example:

  • Full theoretical prefix-cache token hit rate.
  • Effective theoretical prefix-cache token hit rate after MAX_CONTEXT.
  • Number of requests/traces excluded or truncated by MAX_CONTEXT.
  • Per-request hit-rate distribution, not only aggregate token-weighted hit rate.

This would make benchmark comparisons across dataset versions much easier to interpret.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions