Feature/attn decode dynamic sched by shaochangxu · Pull Request #46 · Tencent/hpc-ops

shaochangxu · 2026-05-28T10:48:09Z

Online inference decode phase suffers from significant differences in KV cache lengths across requests in the same batch.Existing attention operators partition tasks based on a static grid dimension, this causes long requests to become bottlenecks or short requests to generate many empty tasks, leading to severe CTA load imbalance.We implement a dynamic task scheduling scheme: the KV caches of all requests are first divided into uniform tile granules (64 tokens). A greedy bin-packing algorithm then distributes these tiles evenly across all CTAs, making the workload of each CTA nearly equal and fundamentally eliminating the long-tail effect.

chaseshao added 2 commits May 28, 2026 16:41

support dynamic schedule attention decode and remove unused code

ca2b71c

support dynamic schedule attention decode and remove unused code

348b38f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/attn decode dynamic sched#46

Feature/attn decode dynamic sched#46
shaochangxu wants to merge 2 commits into
Tencent:mainfrom
shaochangxu:feature/attn_decode_dynamic_sched

shaochangxu commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shaochangxu commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant