Skip to content

Feature/attn decode dynamic sched#46

Open
shaochangxu wants to merge 2 commits into
Tencent:mainfrom
shaochangxu:feature/attn_decode_dynamic_sched
Open

Feature/attn decode dynamic sched#46
shaochangxu wants to merge 2 commits into
Tencent:mainfrom
shaochangxu:feature/attn_decode_dynamic_sched

Conversation

@shaochangxu
Copy link
Copy Markdown
Contributor

Online inference decode phase suffers from significant differences in KV cache lengths across requests in the same batch.Existing attention operators partition tasks based on a static grid dimension, this causes long requests to become bottlenecks or short requests to generate many empty tasks, leading to severe CTA load imbalance.We implement a dynamic task scheduling scheme: the KV caches of all requests are first divided into uniform tile granules (64 tokens). A greedy bin-packing algorithm then distributes these tiles evenly across all CTAs, making the workload of each CTA nearly equal and fundamentally eliminating the long-tail effect.
Clipboard_Screenshot_1779962183

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant