[perf]optimize w4afp8 kernel on deepseek-v3-0324#12921
[perf]optimize w4afp8 kernel on deepseek-v3-0324#12921BBuf merged 7 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @Bruce-x-1997, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a series of optimizations to the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
@yuhyao could you please help review this one?thanks |
There was a problem hiding this comment.
Code Review
This pull request introduces significant performance optimizations for the w4afp8 kernel, particularly for the deepseek-v3-0324 model. The changes include fine-tuning GEMM tiling configurations and replacing slow data preparation logic with highly efficient, parallelized versions using CUB. The use of vectorized memory access and parallel prefix sums are excellent improvements. The code is also made more maintainable by reducing duplication. Overall, these are solid enhancements that deliver the benchmarked performance gains.
|
When this PR is ready for review, ping me. |
6cb5b59 to
5ebd9b6
Compare
|
@FlamingoPg hello, could you help trigger ci again?I don't see w4afp8 related error in failed cases, thanks |
fbf9b89 to
c6462f7
Compare
|
@FlamingoPg hello, could you help trigger it again, thanks, I see the fail case is not related with my changes |
Sorry for a late replay. I will help review this PR today. |
hello, could you help trigger ci again, I see all failed cases is not related with my changes,,thanks |
|
@Bruce-x-1997 Hi, please check the comments. |
ok, thanks for your commnet, I will fix it asap |
Signed-off-by: bruce.xu <bruce.x@gmicloud.ai>
c6462f7 to
7915d93
Compare
|
@yuhyao hello, I can not see comments from you) |
Thanks for adding the information to the description. |
|
@FlamingoPg hello, could you help trigger ci again? |
|
@FlamingoPg could you help trigger ci again?thanks |
|
@FlamingoPg hello, could you help trigger ci again?thanks |
u can try re-run failed ci jobs |
hello, how to rerun failed
hello , how can I rerun failed ci jobs @AniZpZ , I don't find any button to rerun my failed jobs, could you tell me? |
|
/tag-and-rerun-ci |
1 similar comment
|
/tag-and-rerun-ci |
|
/tag-and-rerun-ci |
|
|
||
| void compute_expert_offsets_w4a8( | ||
| cudaStream_t stream, const int32_t* problem_sizes1, int32_t* expert_offsets, int n, int stride = 1, int off = 0) { | ||
| #define compute_expert_offsets_w4a8_call(BLOCK_SIZE) \ |
There was a problem hiding this comment.
| #define compute_expert_offsets_w4a8_call(BLOCK_SIZE) \ | |
| #define compute_expert_offsets_w4a8_call(BLOCK_SIZE) ... | |
| ... | |
| #undef compute_expert_offsets_w4a8_call |
use undef to clean up macro definitions inside the function.
|
/tag-and-rerun-ci |
|
/tag-and-rerun-ci |
|
/tag-and-rerun-ci |
|
Merge with ci passed. https://github.com/sgl-project/sglang/actions/runs/20301300452/job/58407409045?pr=12921 |
Signed-off-by: bruce.xu <bruce.x@gmicloud.ai>
Signed-off-by: bruce.xu <bruce.x@gmicloud.ai>
Signed-off-by: bruce.xu <bruce.x@gmicloud.ai>
Motivation
we use w4afp8 deepseekv3-0324 online, and we find its performance is not good enough when decode batch size < 32
Modifications
fine-grained tiling config
and based on https://github.com/sgl-project/sglang/pull/10027/files
I use cuda-int4 memory access to decrease memory-access pressure
Accuracy Tests
deepseek-v3-0324 w4afp8
Benchmarking and Profiling
###Prefill
to
~10%
###Decode
to
~5%
bench_serving end-to-end case
case 1
from
to
case2
from
to
launch service cmd
Checklist