Support nextn for flashinfer mla attention backend#4218
Support nextn for flashinfer mla attention backend#4218zhyncs merged 3 commits intosgl-project:mainfrom
Conversation
UsageI just experimented 16 x A800 GPU, using block-wise INT8 with nextn for flashinfer (this PR and #3911) and enable torch compile. BenchmarkInput-256-Output-256 (bs1)Input-256-Output-256 (bs16)Another ConfigWith BenchmarkInput-256-Output-256 (bs1)Input-256-Output-256 (bs16) |
merrymercy
left a comment
There was a problem hiding this comment.
Add a test case like this and assert the acceptance length
sglang/test/srt/test_mla_deepseek_v3.py
Lines 106 to 109 in 79a321a
|
I noticed changes applied in #4217 will reject the args combination sglang/python/sglang/srt/server_args.py Line 289 in df84ab2 This comment is just a reminder to prevent others from encountering the same confusion I experienced. |
|
BTW maybe the I met the error below when set it to 3: Because |
|
@junliu-mde Currently you can try |
|
@lambert0312 could you share your dependence?I meet the following problem |
Try upgrading sgl-kernel to the latest 0.0.4 |
already 0.0.4 |
@xihuai18 I used this PR, with a modification I mentioned earlier PR #3911 MTP with INT8 support |
Thanks, I will try |
How about the accuracy in #3911 ? |
Wait, I'll run one. @xihuai18 Usage (Flashinfer Only)AccuracyGSM8KMMLUUsage (Flashinfer + NextN)AccuracyGSM8KMMLU |
Motivation
Support the compatibility of nextn and flashinfer mla attention backend. Currently topk can only be set to 1 due to lack of custom mask support for flashinfer MLA wrapper.
Modifications
FlashInferMLAMultiStepDraftBackendfor draft model when using flashinfer mla and eagle together.FlashInferMLABackendso draft extend and target verify batches can be handled.Usage
The constraints of parameters:
speculative-eagle-topkshould be set to 1speculative-num-draft-tokensshould be power of 2Accuracy
GSM8K
MMLU
Benchmark
The benchmarks are run on 8*H200. Total throughput (tokens/sec) is used as the metric. Each benchmark is run five times and its average result is computed.
Launch
Input-4000-Output-200
Input-128-Output-128
Single prompt
Checklist