[Feat] QWen-1M context support[2/2]: Update block sparse attention backend#5949
Conversation
|
XAttention seems to provide better performance and lower overhead than MInference. Are you guys also looking into using something like XAttention? |
|
please resolve conflicts. Currently there are some restrictions for DCA backend: 1. cuda graph. 2. radix cache 3. dp attention. We need disable cuda graph and radix cache automatically and add assertion for disabling dp attention. |
8d17d49 to
7fe283d
Compare
edf680f to
14295a8
Compare
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
14295a8 to
7c08c28
Compare
92aead9 to
86e188a
Compare
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
@renjie0, this is a great proposal. However, while planning to integrate XAttention into sglang, I discovered that MInference already supports it (see https://github.com/microsoft/MInference/blob/main/minference/modules/xattention.py). Even so, I believe it’s still worthwhile, so I’ll proceed. |
|
@FlamingoPg MInference is best suited for extremely long inputs. Per our benchmark, with Qwen2.5, performance gains only begin once the input exceeds 200 k tokens, and those gains become pronounced past 600 k tokens. Do we have any benchmark result on Qwen3-1M about this backend? |
|
@yuan-luo To be honest, the current minference is still in a very preliminary stage. It doesn't support cudagraph and prefix cache. Perhaps other methods could achieve better efficiency. |
@FlamingoPg Got it. Thanks for the information. |
Motivation
Stack PR:
[1/2]: #5847 (comment)
Todo: support cudagraphs for block sparse attention backend
Modifications
Checklist