fix per token cuda kernel hidden dim cannot divide by 16 by hebiao064 · Pull Request #8543 · sgl-project/sglang

hebiao064 · 2025-07-29T21:25:10Z

File Changes

/workspace/sglang/sgl-kernel/benchmark/bench_per_token_quant_fp8.py
/workspace/sglang/sgl-kernel/csrc/gemm/per_token_quant_fp8.cu

All other compilation file changes will be reverted, I did it to accelerate my development process

Motivation

First of all, I updated the benchmark to include torch's quant implementation, which surfaced that both vllm and sglang quant kernel is not very accurate as torch, but I do feel it its acceptable, since vllm and sglang both delivered similar quantization result.

Secondly, I modified the kernel to allow hidden dim like 1368 which will fail due to #8460, not it's solved.

E2E Test

TP 4:

Accuracy: 0.940
Invalid: 0.000
Latency: 17.850 s
Output throughput: 1098.818 token/s

TP 8: (before this pr, TP8 will fail)
Need to use USE_VLLM_CUTLASS_W8A8_FP8_KERNEL since our cutlass fp8 doesn't support
RuntimeError: mat_a must be multiple of 16 bytes for memory alignment

Accuracy: 0.945
Invalid: 0.000
Latency: 15.905 s
Output throughput: 1248.437 token/s

Before this PR:

Hidden Dim 1368: RuntimeError: Hidden dimension must be divisible by 16, but got 1368

=== Comparison for hidden_dim=2048 ===
Scale differences:
  Torch vs VLLM:   0.00000127
  Torch vs SGLang: 0.00000049
  VLLM vs SGLang:  0.00000092
Output differences:
  Torch vs VLLM:   0.10605328
  Torch vs SGLang: 0.10605328
  VLLM vs SGLang:  0.00000000
Matches (rtol=0.001, atol=1e-05):
  Torch vs VLLM:   ❌
  Torch vs SGLang: ❌
  VLLM vs SGLang:  ✅

=== Comparison for hidden_dim=4096 ===
Scale differences:
  Torch vs VLLM:   0.00000080
  Torch vs SGLang: 0.00000052
  VLLM vs SGLang:  0.00000031
Output differences:
  Torch vs VLLM:   0.11732540
  Torch vs SGLang: 0.11732540
  VLLM vs SGLang:  0.00000000
Matches (rtol=0.001, atol=1e-05):
  Torch vs VLLM:   ❌
  Torch vs SGLang: ❌
  VLLM vs SGLang:  ✅

============================================================
Starting performance benchmark...
per-token-dynamic-quant-fp8-performance:
    batch_size  seq_len  hidden_dim  Torch Reference         VLLM   SGL Kernel
0         16.0     64.0      2048.0        49.376000    15.776001    15.776001
1         16.0     64.0      4096.0        67.648001    20.864001    19.264000
2         16.0    128.0      2048.0        62.080000    21.952000    20.576000
3         16.0    128.0      4096.0        95.551997    32.800000    27.807999
4         16.0    256.0      2048.0        90.559997    34.752000    28.448001
5         16.0    256.0      4096.0       182.239994    58.176000    49.120001
6         16.0    512.0      2048.0       173.536003    63.327998    50.783999
7         16.0    512.0      4096.0       316.960007   104.128003   109.632000
8         16.0   1024.0      2048.0       323.103994   114.239998    99.296004
9         16.0   1024.0      4096.0       599.776030   195.999995   210.784003
10        16.0   2048.0      2048.0       606.271982   215.616003   189.903997
11        16.0   2048.0      4096.0      1152.944028   377.472013   404.320002
12        16.0   4096.0      2048.0      1166.960001   416.608006   376.704007
13        16.0   4096.0      4096.0      2266.863942   746.799976   797.439992
14        32.0     64.0      2048.0        61.951999    21.952000    19.552000
15        32.0     64.0      4096.0        95.232002    32.768000    28.704001
16        32.0    128.0      2048.0        89.631997    34.655999    26.880000
17        32.0    128.0      4096.0       181.456000    58.208000    49.984001
18        32.0    256.0      2048.0       174.000002    63.183997    49.632002
19        32.0    256.0      4096.0       316.383988   103.840001   109.600000
20        32.0    512.0      2048.0       322.576001   113.920003    99.296004
21        32.0    512.0      4096.0       598.847985   195.519999   210.304007
22        32.0   1024.0      2048.0       604.704022   215.839997   189.344004
23        32.0   1024.0      4096.0      1153.232038   377.568007   406.623989
24        32.0   2048.0      2048.0      1167.199969   416.575998   378.176004
25        32.0   2048.0      4096.0      2267.087936   745.280027   797.536016
26        32.0   4096.0      2048.0      2296.047926   823.328018   756.735981
27        32.0   4096.0      4096.0      4488.671780  1476.575971  1577.455997
28        64.0     64.0      2048.0        89.711998    34.784000    27.392000
29        64.0     64.0      4096.0       182.287998    58.143999    48.767999
30        64.0    128.0      2048.0       174.112007    63.327998    50.655998
31        64.0    128.0      4096.0       315.647990   103.840001   109.600000
32        64.0    256.0      2048.0       322.495997   113.920003    99.264003
33        64.0    256.0      4096.0       598.752022   195.519999   210.304007
34        64.0    512.0      2048.0       604.896009   215.519994   188.960001
35        64.0    512.0      4096.0      1153.232038   377.535999   404.864013
36        64.0   1024.0      2048.0      1167.583942   416.927993   375.616014
37        64.0   1024.0      4096.0      2267.311931   746.815979   797.695994
38        64.0   2048.0      2048.0      2296.800017   824.687988   756.352007
39        64.0   2048.0      4096.0      4489.823818  1477.920055  1577.536047
40        64.0   4096.0      2048.0      4549.664021  1634.064019  1504.271984
41        64.0   4096.0      4096.0      8928.223610  2939.167976  3138.047934
42       128.0     64.0      2048.0       173.184007    63.167997    49.408000
43       128.0     64.0      4096.0       315.647990   103.904001   108.704001
44       128.0    128.0      2048.0       322.528005   113.920003   100.383997
45       128.0    128.0      4096.0       598.768026   195.552006   211.776003
46       128.0    256.0      2048.0       605.632007   215.552002   191.264004
47       128.0    256.0      4096.0      1154.207945   377.696007   405.008003
48       128.0    512.0      2048.0      1167.423964   416.864008   376.255989
49       128.0    512.0      4096.0      2267.567992   746.847987   797.504008
50       128.0   1024.0      2048.0      2296.687961   824.959993   756.640017
51       128.0   1024.0      4096.0      4490.496159  1477.823973  1577.520013
52       128.0   2048.0      2048.0      4549.248219  1633.967996  1503.872037
53       128.0   2048.0      4096.0      8930.080414  2940.608025  3139.199972
54       128.0   4096.0      2048.0      9051.616192  3251.071930  2991.216063
55       128.0   4096.0      4096.0     17895.647049  5863.103867  6262.495995

After the PR

=== Comparison for hidden_dim=1368 ===
Scale differences:
  Torch vs VLLM:   0.00000178
  Torch vs SGLang: 0.00000048
  VLLM vs SGLang:  0.00000151
Output differences:
  Torch vs VLLM:   0.09933983
  Torch vs SGLang: 0.09933983
  VLLM vs SGLang:  0.00000000
Matches (rtol=0.01, atol=1e-05):
  Torch vs VLLM:   ❌
  Torch vs SGLang: ❌
  VLLM vs SGLang:  ✅

=== Comparison for hidden_dim=2048 ===
Scale differences:
  Torch vs VLLM:   0.00000128
  Torch vs SGLang: 0.00000049
  VLLM vs SGLang:  0.00000093
Output differences:
  Torch vs VLLM:   0.10555448
  Torch vs SGLang: 0.10555448
  VLLM vs SGLang:  0.00000000
Matches (rtol=0.001, atol=1e-05):
  Torch vs VLLM:   ❌
  Torch vs SGLang: ❌
  VLLM vs SGLang:  ✅

=== Comparison for hidden_dim=4096 ===
Scale differences:
  Torch vs VLLM:   0.00000079
  Torch vs SGLang: 0.00000052
  VLLM vs SGLang:  0.00000029
Output differences:
  Torch vs VLLM:   0.11747192
  Torch vs SGLang: 0.11747192
  VLLM vs SGLang:  0.00000000
Matches (rtol=0.001, atol=1e-05):
  Torch vs VLLM:   ❌
  Torch vs SGLang: ❌
  VLLM vs SGLang:  ✅

============================================================
Starting performance benchmark...
per-token-dynamic-quant-fp8-performance:
    batch_size  seq_len  hidden_dim  Torch Reference         VLLM   SGL Kernel
0         16.0     64.0      1368.0        44.383999    15.744001    15.328000
1         16.0     64.0      2048.0        51.456001    16.096000    14.976000
2         16.0     64.0      4096.0        68.544000    20.992000    20.191999
3         16.0    128.0      1368.0        53.440001    21.120001    19.296000
4         16.0    128.0      2048.0        63.135996    22.016000    19.471999
5         16.0    128.0      4096.0        97.888000    33.216000    29.023999
6         16.0    256.0      1368.0        69.983996    32.832000    35.872001
7         16.0    256.0      2048.0        91.328003    35.135999    28.640000
8         16.0    256.0      4096.0       183.487996    58.336001    50.112002
9         16.0    512.0      1368.0       119.680002    60.192000    63.135996
10        16.0    512.0      2048.0       175.392002    63.840002    50.880000
11        16.0    512.0      4096.0       317.216009   104.383998   109.792002
12        16.0   1024.0      1368.0       230.816007   127.967998   116.159998
13        16.0   1024.0      2048.0       323.103994   114.464000   100.639999
14        16.0   1024.0      4096.0       599.712014   195.391998   210.848004
15        16.0   2048.0      1368.0       422.656000   240.927994   214.880005
16        16.0   2048.0      2048.0       605.632007   214.944005   189.824000
17        16.0   2048.0      4096.0      1154.160023   378.352001   406.208009
18        16.0   4096.0      1368.0       804.159999   467.359990   410.640001
19        16.0   4096.0      2048.0      1169.535995   417.535990   378.048003
20        16.0   4096.0      4096.0      2268.192053   747.135997   797.472000
21        32.0     64.0      1368.0        53.408001    21.120001    17.952001
22        32.0     64.0      2048.0        62.912002    21.984000    20.479999
23        32.0     64.0      4096.0        97.631998    33.119999    29.247999
24        32.0    128.0      1368.0        69.920003    32.736000    34.784000
25        32.0    128.0      2048.0        91.072001    35.039999    28.896000
26        32.0    128.0      4096.0       182.239994    58.336001    50.528001
27        32.0    256.0      1368.0       118.752003    59.904002    62.752001
28        32.0    256.0      2048.0       174.255997    63.712001    50.175998
29        32.0    256.0      4096.0       316.783994   104.064003   109.760001
30        32.0    512.0      1368.0       229.984000   127.616003   116.351999
31        32.0    512.0      2048.0       322.111994   114.143997   100.383997
32        32.0    512.0      4096.0       598.111987   195.424005   210.559994
33        32.0   1024.0      1368.0       423.359990   240.704000   214.368001
34        32.0   1024.0      2048.0       604.896009   215.072006   190.144002
35        32.0   1024.0      4096.0      1155.247986   378.288001   404.960006
36        32.0   2048.0      1368.0       803.680003   467.456013   410.912007
37        32.0   2048.0      2048.0      1168.640018   417.584002   377.983987
38        32.0   2048.0      4096.0      2268.015981   747.135997   797.407985
39        32.0   4096.0      1368.0      1564.447999   921.216011   802.448004
40        32.0   4096.0      2048.0      2297.312021   823.808014   756.479979
41        32.0   4096.0      4096.0      4490.816116  1476.063967  1575.871944
42        64.0     64.0      1368.0        69.311999    32.800000    35.039999
43        64.0     64.0      2048.0        91.855999    35.039999    28.640000
44        64.0     64.0      4096.0       183.264002    58.304001    50.560001
45        64.0    128.0      1368.0       119.023997    59.904002    61.696000
46        64.0    128.0      2048.0       174.303994    63.584000    49.984001
47        64.0    128.0      4096.0       317.375988   104.064003   109.952003
48        64.0    256.0      1368.0       229.791999   127.680004   116.031997
49        64.0    256.0      2048.0       322.704002   114.239998    99.679999
50        64.0    256.0      4096.0       599.135995   195.360005   210.623994
51        64.0    512.0      1368.0       423.424006   240.927994   214.688003
52        64.0    512.0      2048.0       605.184019   214.975998   189.488001
53        64.0    512.0      4096.0      1154.495955   378.271997   405.936003
54        64.0   1024.0      1368.0       804.416001   467.296004   410.463989
55        64.0   1024.0      2048.0      1169.407964   417.376012   378.143996
56        64.0   1024.0      4096.0      2267.567992   745.664001   797.215998
57        64.0   2048.0      1368.0      1564.880013   921.184003   803.551972
58        64.0   2048.0      2048.0      2296.671987   824.576020   756.336004
59        64.0   2048.0      4096.0      4490.367889  1477.280021  1576.000035
60        64.0   4096.0      1368.0      3079.776049  1826.272011  1586.272001
61        64.0   4096.0      2048.0      4547.872066  1632.048011  1503.136039
62        64.0   4096.0      4096.0      8930.239677  2939.935923  3138.751984
63       128.0     64.0      1368.0       119.808003    59.935998    61.983999
64       128.0     64.0      2048.0       175.136000    63.648000    49.984001
65       128.0     64.0      4096.0       317.472011   104.032002   109.984003
66       128.0    128.0      1368.0       229.984000   127.680004   116.047997
67       128.0    128.0      2048.0       322.704002   114.207998   100.383997
68       128.0    128.0      4096.0       598.367989   195.360005   210.848004
69       128.0    256.0      1368.0       422.656000   240.672007   214.688003
70       128.0    256.0      2048.0       605.664015   215.136006   190.047994
71       128.0    256.0      4096.0      1154.495955   378.095999   405.984014
72       128.0    512.0      1368.0       804.048002   467.103988   410.111994
73       128.0    512.0      2048.0      1169.216037   417.311996   377.279997
74       128.0    512.0      4096.0      2267.488003   746.976018   797.248006
75       128.0   1024.0      1368.0      1565.472007   921.248019   803.248018
76       128.0   1024.0      2048.0      2296.815991   823.935986   755.807996
77       128.0   1024.0      4096.0      4489.888191  1476.960003  1575.984001
78       128.0   2048.0      1368.0      3080.512047  1826.416016  1584.064007
79       128.0   2048.0      2048.0      4548.096180  1632.528007  1503.167987
80       128.0   2048.0      4096.0      8930.784225  2940.479994  3138.720036
81       128.0   4096.0      1368.0      6115.839958  3640.095949  3151.936054
82       128.0   4096.0      2048.0      9051.023960  3252.608061  2991.535902
83       128.0   4096.0      4096.0     17893.791199  5863.535881  6265.535831

To be added

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @hebiao064, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request aims to enhance the per_token_quant_fp8 CUDA kernel by extending its compatibility to hidden dimensions divisible by 8, rather than being strictly limited to those divisible by 16. It introduces a new 8-element vectorized kernel path and dynamically selects the appropriate kernel based on the input tensor's dimensions. To ensure correctness and facilitate development, a pure PyTorch reference implementation has been added to the benchmarking suite, allowing for direct verification against the optimized CUDA kernels. The build system has also been temporarily streamlined to focus exclusively on this kernel, indicating an ongoing development effort.

Highlights

Expanded per_token_quant_fp8 Support: The per_token_quant_fp8 CUDA kernel now supports input tensors where the hidden dimension is divisible by 8, not just 16. This is achieved by introducing a new 8-element vectorized kernel path and dynamically selecting the appropriate kernel based on the hidden_dim.
Enhanced Benchmarking and Verification: A pure PyTorch reference implementation for per-token FP8 quantization has been added to the benchmark script. This allows for direct correctness comparison and more comprehensive performance analysis against the VLLM and SGLang kernel implementations across various hidden_dim values.
Streamlined Development Environment: The build configuration (CMakeLists.txt) and Python bindings (common_extension.cc, __init__.py) have been temporarily stripped down to focus solely on the per_token_quant_fp8 kernel. This facilitates isolated development, debugging, and benchmarking of this specific feature.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR adds support for hidden_dim values that are multiples of 8 in the per_token_quant_fp8 kernel. The changes include updating the CUDA kernel, modifying build files, and extending the benchmark. Critical issues related to commented-out code in build files and Python packages need to be resolved. A potential correctness issue in the benchmark reference implementation and opportunities to improve the CUDA code's portability and maintainability were also identified.

BBuf · 2025-07-30T01:43:43Z

Clean code and add modify sgl-kernel test_per_token_quant_fp8.py.

gemini-code-assist · 2025-07-31T05:00:27Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…project/sglang into bhe/fix_per_token_hidden_16

BBuf

LGTM.

hebiao064 · 2025-07-31T06:03:51Z

Need to use USE_VLLM_CUTLASS_W8A8_FP8_KERNEL since our cutlass fp8 doesn't support
RuntimeError: mat_a must be multiple of 16 bytes for memory alignment
cc @BBuf

hebiao064 · 2025-08-02T07:23:58Z

USE_VLLM_CUTLASS_W8A8_FP8_KERNEL

Repro steps: https://github.com/hebiao064/sglang/blob/98ca495e3e0546e0d23d5d03460aa924f17d4862/sgl-kernel/benchmark/repro_fp8_mm.py

hebiao064 · 2025-08-12T04:39:40Z

USE_VLLM_CUTLASS_W8A8_FP8_KERNEL

Fixed by #9093

…#8543)

Merge branch 'sglang_public_tracker of git@code.alipay.com:Theta/SGLang.git into main https://code.alipay.com/Theta/SGLang/pull_requests/192 Reviewed-by: 得泽 <zhangkaihong.zkh@antgroup.com> * fix duplicate args in schedule_batch (sgl-project#7816) * [AMD] Fail gracefully when AITER is unavailable gfx90a GPUs (sgl-project#7187) * docs: update README (sgl-project#7821) * [theta] add py-spy deps * feat: support DeepSeek-R1-W4AFP8 model with ep-moe mode (sgl-project#7762) * Enable ModelOpt Llama4 fp8 checkpoint deployment in SGLang (sgl-project#7129) * [Minor] Fix sporadic CI timeout caused by underestimated tests. (sgl-project#7850) * [Bugfix] Fix two batch overlap with auto DeepEP Dispatch (sgl-project#7853) * Fix cache modules of triton import error (sgl-project#7832) * [router] forward stream_options in request (sgl-project#7860) * Fix illegal memory in trtllm allreduce fusion (sgl-project#7864) * Fix llama4 vision (sgl-project#7840) * Support Mimo-VL (sgl-project#7579) * fix: Handles input_embeds in GenerateReqInput when n>1 (sgl-project#7830) * [Multimodal][Perf] Use `pybase64` instead of `base64` (sgl-project#7724) * Bump xgrammar's version to 0.1.20 (sgl-project#7866) * [CPU]convert topk_weights to fp32 for INT8 and FP8 paths (for llama4) and fix LmHead weight pack (sgl-project#7818) * [PD] Add guidance for prefill bootstrap timeout (sgl-project#7846) * Update native_api doc to match the change in the `get_model_info` endpoint (sgl-project#7660) * Revert "Embedding parallel by attn_tp (sgl-project#7623)" (sgl-project#7880) * chore: bump v0.4.9.post1 (sgl-project#7882) * Fixes typo in assertion message (sgl-project#7895) * [CI] Add deepep tests to CI (sgl-project#7872) * [CPU] [FP8] set SGLANG_CPU_FP8_CVT_FTZ in CMakeLists.txt (sgl-project#7885) * [CPU][Qwen3 MoE] Enable fused_topk CPU fusion and enhance FP8 TP padding (sgl-project#7838) * Remove unused imports (sgl-project#7898) * [router] Update metrics when request completes (sgl-project#7899) * [feature] Add start step profile argument in /start_profile (sgl-project#7608) * [bugfix] add pd router policy validation (sgl-project#7904) * vlm: support video as an input modality (sgl-project#5888) * Feat: Support Phi-3.5-MoE in SGLang (sgl-project#7907) * add sentencepiece as dependency explicitly (sgl-project#7922) * Fix bug of deepseek-v3 under DP+EP mode with large batchsize/seqlen (sgl-project#6449) * [feature]Ascend quantization support (sgl-project#7791) * [ready b200] fuse allreduce+add_rmsnorm in prepare_attention + mlp module (sgl-project#7775) * Support Kimi K2 (sgl-project#7940) * [feature] kv transfer support of ascend npu (sgl-project#7795) * fix: minor fix for modelopt weight load compatibility (sgl-project#7953) * temporarily disable deepep-8-gpu and activate two small tests (sgl-project#7961) * [fix]Update unitest for fp8_blockwise_scaled_grouped_mm kernel (sgl-project#7932) * chore: bump sgl-kernel v0.2.5 (sgl-project#7964) * Revert "[PD Disaggregation] replace transfer with batch transfer for better performance (sgl-project#7236)" (sgl-project#7968) * chore: upgrade xgrammar 0.1.21 (sgl-project#7962) * delete uselese code caused by fuse allreduce+add_rmsnorm pr (sgl-project#7970) * Fix wrong gemm branch cause 250us slower (sgl-project#7969) * [router] add worker abstraction (sgl-project#7960) * chore: upgrade sgl-kernel 0.2.5 (sgl-project#7971) * chore: bump v0.4.9.post2 (sgl-project#7963) * [minor fix] llama4 hybrid memory (sgl-project#7950) * [minor fix] SWA missing methods (sgl-project#7972) * [script] update loogle test (sgl-project#7975) * perf: add kimi k2 fused_moe tuning config for h20_3e * [theta] perf: add kimi k2 fused_moe tuning config for h200 * [minor fix] SWA missing methods (sgl-project#7972) * [script] update loogle test (sgl-project#7975) * perf: add kimi k2 fused_moe tuning config for h30_3e * docs: update README (sgl-project#7985) * Overlap the gating function with shared experts in DeepSeek (sgl-project#7978) * [BugFix] fix pre_reorder_triton_kernel default int32 issue (sgl-project#7814) * [minor] Add server_args check for Llama4 with hybrid (sgl-project#7988) * Tiny fix mooncake log warning wrong output (sgl-project#7952) * [BugFix] add verify logit_bias to avoid crash because of IndexError (sgl-project#7749) * SWA Prefix Cache (sgl-project#7367) * chore: remove unnecessary limits on quantization methods in test script (sgl-project#7997) * Refactor dynamic LoRA update to fix incorrect handling of variant weight shapes (sgl-project#7844) * Support for Phi-1.5 & Phi-2 models (sgl-project#7862) * [Dockerfile] Multi-arch support for ROCm (sgl-project#7902) * [CPU] fix no attribute 'can_fuse_mlp_allreduce' error (sgl-project#8010) * perf: add kimi k2 fused_moe tuning config for h30_3e (sgl-project#8021) * [ci] CI supports use cached models (sgl-project#7874) * [Minor] Remove redundant print (sgl-project#8005) * [Feature]TP Group Switching for PD-Multiplexing (sgl-project#7653) * [Feature] CUDA Green Context Support (sgl-project#7649) * Fix flaky CI: test_vlm_models (sgl-project#8006) * Fix Bug 'get_cpu_copy not Implemented' in pd offloading mode (sgl-project#7982) * prevent server crash from potential invalid grammar (sgl-project#7897) * Setup workflow for releasing mi300x and mi350x dockers. (sgl-project#8035) * fix: modality length mismatch with image_data (sgl-project#7887) * Update CODEOWNERS (sgl-project#8044) * perf: add qwen3-30b-a3b fused moe tuning config for h20 * [feat]Support fusion kernel for constructing quant input and scale factor for fp8_blockwise_scaled_grouped_mm (sgl-project#8023) * feat: update multimodal data handling in engine entrypoint (sgl-project#8002) * fix: remove redundant rotary embedding cache recomputation in MiniCPM (sgl-project#8022) * Fix the input tools format and history tool_calls in OpenAI API (sgl-project#6556) * fix: resolve arm build issue (sgl-project#8052) * concurrently load weights of DeepseekV2ForCausalLM (sgl-project#7943) * H20 tune config for Kimi (sgl-project#8047) * Update amd docker image. (sgl-project#8045) * feat: replace Decord with video_reader-rs (sgl-project#5163) * remove kv_a.congigous in DeepseekV2AttentionMLA (sgl-project#8058) * update transformers to 4.53.2 (sgl-project#8029) * Fix different device type adjustment in PP (sgl-project#7760) * Use device_group for all_gather when disabling overlap scheduling (sgl-project#8001) * Revert "feat: replace Decord with video_reader-rs" (sgl-project#8077) * Fix CI xeon test with triton 3.3.1 (sgl-project#8086) * fix greenctx stream compability (sgl-project#8090) * [misc] update nvshmem and pin deepEP commit hash (sgl-project#8098) * [Feature] Layer-wise Prefill (sgl-project#7634) * [1/n] chore: decouple quantization implementation from vLLM dependency (sgl-project#7992) * refactor: unify names of the feature field of MultimodalDataItem (sgl-project#8075) * feat: add tp_rank, pp_rank and dp_rank labels for scheduler metrics (sgl-project#7597) * [ci] limit cmake build nproc (sgl-project#8100) * [ci] disable memory imbalance check for draft worker (sgl-project#8108) * [Fix] ensure DeepGEMM is only enabled for FP8_W8A8 models (sgl-project#8110) * [ci] recover 8-gpu deepep test (sgl-project#8105) * Refactor: move all quantization-related code to `srt/layer/quantization` (sgl-project#7989) * [kernel] opt moe align block kernel by block/warp scan algorithm (sgl-project#7884) * Super tiny fix typo (sgl-project#8046) * fix: update HostKVCache init to report correct msg when available memory is not enough (sgl-project#8102) * [Hunyuan]: Fix Dense Model Support (sgl-project#8117) * feat: add production metric for retracted requests due to insufficient kvcache (sgl-project#7030) * refactor: simply MultimodalTokens logic (sgl-project#7924) * [Fix][Ready]Fix register spilling in cutlass nvfp4 gemm kernel on Blackwell (sgl-project#8127) * Feat: Support Granite 3.0 MoE in SGLang (sgl-project#7959) * load draft model fix (sgl-project#7506) * [CPU][Llama4] Fix Llama4 MoE inputs with "apply_router_weight_on_input" (sgl-project#7889) * [Quantization][w8a8_int8] Fix weight loading issue for w8a8_int8 path with "ignore" layer list in quantization config (sgl-project#7820) * Hicache Storage Layer Prototype (sgl-project#7704) * Revert "Fix different device type adjustment in PP" (sgl-project#8141) * feat: enchance green context stream creation robust with backward compatibility (sgl-project#8136) * fix compressed tensors WNA16 imports (sgl-project#8142) * [Bugfix] Fix w8a8_int8 import error on NPU (sgl-project#8147) * [3/n] chore: decouple AWQ implementation from vLLM dependency (sgl-project#8113) * [router] Refactor router and policy traits with dependency injection (sgl-project#7987) * [AMD] Add triton awq_dequantize kernel to support AWQ on ROCm (sgl-project#7661) * [Doc] Steps to add a new attention backend (sgl-project#8155) * chore: tune mem fraction static for vlm (sgl-project#6881) * Support NVFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (sgl-project#7302) * Feat: Support audio in Phi4-mm model (sgl-project#8048) * [PD] Support non-MLA models PD different TP with DP attention (sgl-project#7931) * [health_generate] fix: fix the /health_generate always success bug (sgl-project#8028) * [router] router metrics cleanup (sgl-project#8158) * [router] allow router to have empty workers (sgl-project#8160) * Add GB200 wide-EP docker (sgl-project#8157) * [1/N] MoE Refactor: refactor `select_experts` (sgl-project#7966) * chore: bump sgl-kernel v0.2.6 (sgl-project#8165) * chore: upgrade sgl-kernel 0.2.6 (sgl-project#8166) * [theta] sync bailing * Fix suffix mismatch for the metrics. (sgl-project#8168) * Update README.md (sgl-project#8171) * Clean up server args (sgl-project#8161) * Fix LoRA buffer contamination during adapter eviction (sgl-project#8103) * Fix Dockerfile.gb200 (sgl-project#8169) * [router] add ut for worker and errors (sgl-project#8170) * bugfix: fix sglang crash in NVIDIA MIG container (sgl-project#8167) * Support start up LoRA server without initial adapters (sgl-project#8019) * Clean warning logs for gate_proj loading in Lora (sgl-project#8172) * Fix tuning_fused_moe_triton.py (sgl-project#8175) * [Feature] Simple Improve Health Check Mechanism for Production-Grade Stability (sgl-project#8115) * Add bf16 output option for dsv3_router_gemm kernel (sgl-project#7999) * Enable FlashInfer support encoder models and add head_dim padding workaround (sgl-project#6230) * Add get_hidden_dim to qwen3.py for correct lora (sgl-project#7312) * feat: add h200 tp 16 kimi k2 moe config (sgl-project#8176) * feat: add b200 tp 16 kimi k2 moe config (sgl-project#8178) * fix moe gate dtype, fix tbo, fix fake dispatch (sgl-project#7825) * Revert "[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability" (sgl-project#8181) * feat: update nccl 2.27.6 (sgl-project#8182) * Feat: Support for Persimmon Model (sgl-project#7983) * feat: add h200 tp 16 kimi k2 moe config (sgl-project#8183) * Fix eagle3 cuda graph (sgl-project#8163) * fix: fix the bug of loading Internvl3 (sgl-project#8067) * Fix dtype error in CI (sgl-project#8197) * Cherry-pick commit 2dc5de40 "perf: add bailing mo..." 到当前分支 * [router] add ut for pd request, metrics and config (sgl-project#8184) * [feature] enable NPU CI (sgl-project#7935) * [fix] fix modelopt fp4 on b200 (sgl-project#8195) * chore: bump sgl-kernel v0.2.6.post1 (sgl-project#8200) * Apply fused sorted token ids padding (sgl-project#8193) * [Refactor] simplify multimodal data processing (sgl-project#8107) * [theta] feat vl name * [router] add ut for pd router (sgl-project#8208) * [router] upgade router version to 0.1.6 (sgl-project#8209) * Remve router gemm output dtype conversion (sgl-project#8204) * chore: upgrade sgl-kernel 0.2.6.post1 (sgl-project#8202) * [Feature] Add a test for Layer-wise Prefill (sgl-project#8231) * docs: update 2025 h2 roadmap (sgl-project#8237) * fix: retrieve mm token by modality, raise error if none (sgl-project#8221) * [AMD] Remove vllm's scaled_fp8_quant and moe_sum when SGLANG_USE_AITER=1 (sgl-project#7484) * [theta] tune h20 config for qwen3 235b * [theta] tune h20 config for qwen3 235b * fix: sgl-router remove dead code (sgl-project#8257) * [fix] benchmark : routed_scaling_factor is None (sgl-project#8059) * [Benchmark] add disable-auto-run param for hicache/bench_multiturn (sgl-project#7822) * Preliminary Support for Qwen3XMLDetector (sgl-project#8260) * chore: bump v0.4.9.post3 (sgl-project#8265) * PullRequest: 178 perf: add qwen235b h20-3e fused moe kernel config * [theta] tune h20 config for qwen3 480b * Skip llama4 vision module loading when multimodal disabled (sgl-project#8272) * PullRequest: 180 新增Qwen480B和Qwen235B在NVIDIA H20-3e上的Fused MoE Triton配置 * Fix sgl-kernel ci test (sgl-project#8284) * [theta] tune h200 config for qwen3 480b * Introduce Stable LoRA ID System for Overlapped Updates and Prefix Caching (sgl-project#8261) * Hicache IO kernel refactoring (sgl-project#8264) * bug fix and tag (sgl-project#8282) * HiCache Fix (sgl-project#8288) * [sgl-kernel] Opt per_token_quant_fp8 with warp reduce (sgl-project#8130) * [router] add common ut infra to mock worker and app (sgl-project#8295) * fix: workaround for deepgemm warmup issue (sgl-project#8302) * [Performance][PD Disaggregation] optimize TokenToKVPoolAllocator by sorting free pages (sgl-project#8133) * Fix the issue of incorrect finish reason in final stream response chunk returned during tool call (sgl-project#7708) * fix: match chat-template for internvl3 (sgl-project#8262) * Fix gemma3n with hybrid swa (sgl-project#8240) * chore: upgrade sgl-kernel 0.2.7 (sgl-project#8304) * fix: prevent crashes due to logit bias dimension mismatch (sgl-project#7685) * feat(function call): complete utility method for KimiK2Detector and enhance documentation (sgl-project#8043) * Fix incomplete tool call capture issue in streaming response of DeepSeek-V3 when enable MTP (sgl-project#7562) * [AMD] Pull latest image for AMD CI (sgl-project#8070) * Pin the version of petit kernel to fix the APIs (sgl-project#8235) * [bug] fix pd completion protocol for batching support (sgl-project#8317) * [router] fix pd model completion request (sgl-project#8303) * fix bug when eos_ids==0 (sgl-project#8315) * [router] add endpoint unit test (sgl-project#8298) * [code style] Clean dead triton kernel code in fused_moe and useless vllm_ops import (sgl-project#8310) * chore: upgrade flashinfer v0.2.9rc1 (sgl-project#8301) * [router] add streaming unit test (sgl-project#8299) * [router] add request format unit test (sgl-project#8300) * HiCache Storage TP Refinement (sgl-project#8307) * breakdown kernel update (sgl-project#8334) * support idle batch for TBO (sgl-project#8233) * [Feature] Integrate quick allreduce and select the best allreduce implementation (sgl-project#6619) * DP Enhancement (sgl-project#8280) * fix: Fix failed functional tests https://github.com/meta-llama/llama-stack-evals (sgl-project#8266) * [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs (sgl-project#7135) * [CPU] Add tutorial docs for SGL on CPU (sgl-project#8000) * chore: upgrade mooncake 0.3.5 (sgl-project#8341) * [torch.compile bug] avoid biased_grouped_topk_impl func repeatedly triggering `torch.compile` in forward pass (sgl-project#8353) * [P/D] Support ipv6 in P/D scenario (sgl-project#7858) * Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (sgl-project#8344) * [Bugfix][Feat] Add XML-ish grammar in EBNFComposer and fix misc bugs in Qwen3 detector (sgl-project#8357) * Clean up server_args, triton cache manager (sgl-project#8332) * fix: upgrade nccl version (sgl-project#8359) * [Feat] Add reasoning parser for Qwen/Qwen3-235B-A22B-Thinking-2507 (sgl-project#8363) * fix: kimi k2 xgrammar crash (sgl-project#8367) * Fix FP4 MoE accuracy from missing routed_scaling_factor (sgl-project#8333) * [CI] Fix flaky threshold (sgl-project#8370) * chore: bump v0.4.9.post4 (sgl-project#8305) * Fix test_moe_fused_gate_combined sgl-kernel ci test (sgl-project#8374) * Uodate Dockerfile.gb200 to latest sglang (sgl-project#8356) * chore: improve mmmu benchmark (sgl-project#7000) * Save peak memory in logits processor (sgl-project#8343) * Extract update_weights from RL Engine to SGLang to keep simplicity and fix torch reduce (sgl-project#8267) * chore: improvements on mm_utils (sgl-project#7737) * vlm: optimize tensor transport (sgl-project#6003) * Tiny assert EPLB is used together with expert parallel (sgl-project#8381) * model: support intern-s1 (sgl-project#8350) * Add perf tests for LoRA (sgl-project#8314) * Remove slot usage in code to be backward-compatible with python 3.9 (sgl-project#8396) * Add docker release flow for gb200 (sgl-project#8394) * HiCache, check before terminate prefetching (sgl-project#8372) * Add nvfp4 scaled mm benchmark. (sgl-project#8401) * Urgent Fix: intern-s1 chat-template matching (sgl-project#8403) * Tool to dump and compare internal activation tensors (sgl-project#7976) * Minor tool for comparison of benchmark results (sgl-project#7974) * Fix bench script making input data on L2 cache (sgl-project#7739) * [NVIDIA] Add Flashinfer MoE blockscale fp8 backend (sgl-project#8036) * Update Cutlass in sgl-kernel to v4.1 (sgl-project#8392) * fix: minor fix TransportProxyTensor under tp (sgl-project#8382) * [router] add different policies for p node and d node (sgl-project#8395) * Add A800 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (sgl-project#8351) * fix: fix the missing metrics on non-rank0 nodes (sgl-project#7720) * [2/N] MoE Refactor: Unify weight loader and quant methods (sgl-project#8397) * Use FlashInfer FP4 gemm. (sgl-project#8241) * Support precomputed_embeddings for Llama 4 (sgl-project#8156) * [hotfix] fix merge conflicts in FlashInferEPMoE (sgl-project#8405) * chore: update CODEOWNERS (sgl-project#8407) * chore: upgrade flashinfer v0.2.9rc2 (sgl-project#8406) * Support triton kernels v3.4.0 for fused_moe (sgl-project#8258) * [Bugfix] Prevent PD server crash from invalid grammar (sgl-project#8062) * Change to use native arm runner (sgl-project#8414) * Support overlapped lora updates (sgl-project#8213) * Support ue8m0 for triton quant kernel (sgl-project#7603) * Fix: Improve test_openai_function_calling unit test and fix reasoning_parser.py think_start_token logic (sgl-project#8316) * bugfix: Fix multiple finish_reason chunks and tool_calls finish reason check (sgl-project#8417) * Fix test_openai_server (sgl-project#8419) * Fix docker buildx push error (sgl-project#8425) * bugfix: Fix XGrammar backend to use model's EOS tokens for constrained generation (sgl-project#8422) * [router] improve router logs and request id header (sgl-project#8415) * [feat] Support different attention backends for prefill and decode (sgl-project#6338) * chore: bump transformer to 4.54.0 (sgl-project#8416) * [PD] Fix abort_request for PD disaggregation (sgl-project#8352) * GLM-4.5 Model Support (sgl-project#8224) * Remove zstd compression for building Dockerfile.gb200 (sgl-project#8442) * doc: add bench_one_batch_server in the benchmark doc (sgl-project#8441) * GLM-4.5 Model Support Follow-up (sgl-project#8445) * fix GLM4_MOE launch with compressed_tensor quant model (sgl-project#8456) * Fix per_token_group_quant_8bit when hidden_dim // group_size is not divided by 4. (sgl-project#8449) * Revert "[kernel] opt moe align block kernel by block/warp scan algorithm" (sgl-project#8457) * chore: bump v0.4.9.post5 (sgl-project#8458) * fix:reorder topk experts to ensure shared expert replaces minimal score (sgl-project#8125) * perf: add kimi k2 h200 fused moe config (extracted from theta-asap-sglang-049) * Cherry-pick commit 4a75e015 "Add draft model fuse..." 到当前分支 * Update PR template (sgl-project#8465) * feat: throttle requests at scheduler based on --max_queued_requests (sgl-project#7565) * [theta] tuning script for glm4 moe * perf: add fused moe kernel config glm4.5,h20-3e,tp8 * [theta] tuning script for glm4 moe h20 * fix: update dep (sgl-project#8467) * [NVIDIA] Change to use `num_local_experts` (sgl-project#8453) * Fix parsing ChatCompletionMessage (sgl-project#7273) * [3/N] MoE Refactor: Simplify DeepEP Output (sgl-project#8421) * feat: support glm4 tuning (sgl-project#8473) * Fix DEEPEP BF16 compatibility for Deepseek Style model like GLM 4.5 (sgl-project#8469) * Update codeowner (sgl-project#8476) * chore: add glm4 fp8 tp8 config (sgl-project#8478) * chore: add glm 4.5 fp8 tp4 config (sgl-project#8480) * [CI]Add genai-bench Performance Validation for PD Router (sgl-project#8477) * Update CODEOWNERS (sgl-project#8485) * Rename the last step in pr-test.yml as pr-test-finish (sgl-project#8486) * Reduce memory usage for fp4 moe (sgl-project#8413) * Tiny add warnings for DeepEP when it is suboptimal (sgl-project#8426) * Support colocating requests (sgl-project#7973) * Fix incorrect KV cache allocation for MTP models. (sgl-project#8482) * Add PVC and update resource limits in k8s config (sgl-project#8489) * chore: bump v0.4.9.post6 (sgl-project#8517) * Always trigger pr-test (sgl-project#8527) * Update README.md (sgl-project#8528) * [sgl-kernel performace] fix fp8 quant kernels dispatch __nv_fp8_e4m3 bug to improve performance 10%-20% (sgl-project#8499) * Update cutlass_moe.py (sgl-project#8535) * Fix moe align kernel test (sgl-project#8531) * Split the scheduler into multiple mixin classes to reduce the file size (sgl-project#8483) * bring back kimi vl ci (sgl-project#8537) * fix: temporarily disable cuda-ipc for mm data tensor (sgl-project#8431) * Support EPLB in FusedMoE (sgl-project#8448) * feat(hicache): support file backend reading directory config form env. (sgl-project#8498) * feature(pd-hicache): Prefill instances support reusing the RemoteStorage Cache via HiCache. (sgl-project#8516) * [router] allow longer time out for router e2e (sgl-project#8560) * Update cutlass_moe.py (sgl-project#8545) * Update CODEOWNERS (sgl-project#8562) * [feature] [sgl-router] Add a dp-aware routing strategy (sgl-project#6869) * [Hot-Fix] moe_aligned_block_size CI failed in AMD (sgl-project#8461) * Cherry-pick commit 4fdc06a9 "add fp8a8 kimi-k2 dr..." 到当前分支 * [Model] Add support for Arcee Foundational Model (sgl-project#8154) * Revert "Fix the input tools format and history tool_calls in OpenAI API (sgl-project#6556)" (sgl-project#8584) * Add hf3fs support for hicache storage (based on sgl-project#7704) (sgl-project#7280) * [router] migrate router from actix to axum (sgl-project#8479) * [Fix]Fix index oob in get_group_gemm_starts kernel. (sgl-project#8564) * Bump transfomers to 4.54.1 to fix Gemma cache issue. (sgl-project#8541) * Add GKE's default CUDA runtime lib location to PATH and LD_LIBRARY_PATH. (sgl-project#8544) * Bug: Fix google gemma3n-mm audio input not working bug (sgl-project#8365) * update sgl-kernel for EP: kernel part (sgl-project#8514) * chore: bump sgl-kernel v0.2.8 (sgl-project#8599) * [bugfix] Fix 2 minor bugs in the hicache storage layer (sgl-project#8404) * fix incorrect increase of hit count (sgl-project#8533) * Support l3 cache (mooncake store) for hiradix cache (sgl-project#7211) * [theta] Conditionally import HiCacheHF3FS sgl-project#8598 * update sgl-kernel for EP: python part (sgl-project#8550) * add SVG logo (sgl-project#8603) * [4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE (sgl-project#8515) * fix: fork should not run pypi router (sgl-project#8604) * model: support Step3V (sgl-project#8583) * [Feature] Hybrid EP and TP (sgl-project#8590) * chore: bump v0.4.10 (sgl-project#8608) * [PD] Use batch transfer for rdma transport and add notes for mnnvl usage (sgl-project#8595) * [bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. (sgl-project#8611) * Fix hf3fs_fuse import error (sgl-project#8623) * Update step3v default config (sgl-project#8626) * [ci] fix genai-bench execution cmd (sgl-project#8629) * [router] update router pypi version (sgl-project#8628) * [Optimization][Perf] Disable the GC during CUDA graph capture to speed up by up to 3x (sgl-project#8577) * Fix typos in py_test/test_launch_server.py (sgl-project#6227) * misc: Remove debug print to logger.info (sgl-project#8633) * SGLang HiCache NIXL Connector (sgl-project#8488) * [bug] remove pdlb from minilb since its no longer available (sgl-project#8634) * [bugfix] Fix flashinfer cutlass EP moe after MoE refactor (sgl-project#8630) * Conditionally import HiCacheHF3FS (sgl-project#8598) * TRTLLM Gen MLA Decode Kernel Integration (same as sgl-project#7938) (sgl-project#8632) * Fix nan value generated after custom all reduce (sgl-project#8532) * Revert "Fix nan value generated after custom all reduce (sgl-project#8532)" (sgl-project#8642) * Feature/modelscope model download (sgl-project#8083) * chore: speedup NPU CI by cache (sgl-project#8270) * [Bugfix] fix w8a8_int8 load issue (sgl-project#8308) * [bugfix] fix router python parser for pd urls (sgl-project#8644) * [router] add basic usage doc (sgl-project#8640) * [router] upgrade router version to 0.1.8 (sgl-project#8645) * [NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE (sgl-project#8450) * HiCache, fixing hash value indexing (sgl-project#8636) * Interface change for kvcache io to support page first layout (sgl-project#8318) * Update batch size limitation of dsv3_router_gemm kernel to 16 (sgl-project#8051) * chore: bump v0.4.10.post1 (sgl-project#8652) * Add hf3fs_utils.cpp to package-data (sgl-project#8653) * Fix chat template handling for OpenAI serving (sgl-project#8635) * Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… (sgl-project#8511) * [5/N] MoE Refactor: Update MoE parallelism arguments (sgl-project#8658) * Increase tolerance to address CI failures (sgl-project#8643) * [Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 (sgl-project#8013) * [DOC]Update sgl-kernel README (sgl-project#8665) * fix per token cuda kernel hidden dim cannot divide by 16 (sgl-project#8543) * fix arg typo for --disaggregation-transfer-backend (sgl-project#8664) * [fix] fix pd disagg error of vlms (sgl-project#8094) * Disable tp for shared experts under expert parallelism for GLM4.5 model (sgl-project#8647) (sgl-project#8647) * [bugfix] Fix page size for create_flashmla_kv_indices_triton() for cutlass mla (sgl-project#8685) * [bug] limit bootstrap room to to [0, 2^63 - 1] (sgl-project#8684) * Update CODEOWNERS (sgl-project#8686) * Fix deepgemm masked grouped gemm jit compile (sgl-project#8679) * Fix FP8 block quantization when N or K is not multiples of 128 (sgl-project#8648) * bugfix(hicache): Fix 'MooncakeStore' not defined error. (sgl-project#8668) * upgrade xgrammar 0.1.22 (sgl-project#8522) * [bugfix] Add 'disaggregation_mode' parameter to warmup function when compile deep_gemm manually (sgl-project#8618) * Add support for NCCL symmetric memory for TP allreduces (sgl-project#8238) * [1/2] sgl-kernel: Fuse routed scaling factor into select_experts (sgl-project#8364) * chore(gb200): update dockerfile to handle fp4 disaggregation (sgl-project#8694) * [bugfix] Apply routed scaling factor to cutlass_fused_experts_fp8 (sgl-project#8688) * Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled (sgl-project#7434) * model: adapt mllama4 to VisionAttention (sgl-project#8512) * Add tensor.detach() back to update weight util (sgl-project#8691) * [Doc] Polish sgl-kernel readme for cu126 build error (sgl-project#8704) * [theta] merge 0802-3 * Revert "[1/2] sgl-kernel: Fuse routed scaling factor into select_experts" (sgl-project#8706) * [router] minor code clean up and and refactoring (sgl-project#8711) * [Bug] fix green context's incompatibility with `cuda < 12.4` (sgl-project#8701) * chore: bump sgl-kernel v0.2.9 (sgl-project#8713) * Remove assertions about per group quant fp8 (sgl-project#8717) * [FIX] Fix the nightly CI by disabling swa mem pool for gemma2 (sgl-project#8693) * Fix triton moe error caused by TopK refactor (sgl-project#8705) * [router] Implement HTTP Dependency Injection Pattern for Router System (sgl-project#8714) * [Feature] Radix Tree in C++ (sgl-project#7369) * [Perf]Use Cooperative Schedule for H100 & H200 & H800 in fp8_blockwise_scaled_grouped_mm (sgl-project#8722) * Fix fused MoE when `routed_scaling_factor is None` (sgl-project#8709) * Tiny fix CI pytest error (sgl-project#8524) * [hotfix] fix mixtral with tensor-level compressed-tensor quantization (sgl-project#8721) * Support limiting max loaded loras in CPU. (sgl-project#8650) * Reduce memory accumulation in long-running server (sgl-project#8306) * HiCache storage, style change and bug fix (sgl-project#8719) * [feat] support minimum token load balance in dp attention (sgl-project#7379) * Do layernorm before allgather for DP attention (sgl-project#8631) * [fix] Fix divide by zero error for llama4. (sgl-project#8683) * feat: Add new moe triton for NVIDIA RTX 6000 Ada (sgl-project#8547) * [Improvements] Merge health check route (sgl-project#8444) * chore: bump sgl-kernel 0.3.0 with torch 2.8.0 (sgl-project#8718) * Save cuda graph memory for fa3 (sgl-project#8567) * [CUDA Graph] save cuda graph memory by using next_token_logits_buffer (sgl-project#8579) * [DP] fix the compatibility issue between DP attention and `--attention-backend triton` (sgl-project#8723) * chore: bump v0.4.10.post2 (sgl-project#8727) * feat: Support DP Attention for step3_vl (sgl-project#8699) * [RL] fix update weight for FusedMoE with EP (sgl-project#8676) * use fp32 for e_score_correction_bias in GLM-4.5 (sgl-project#8729) * Fix triton kernels topk with keyword arguments (sgl-project#8732) * feat: support cutlass_moe_fp8 kernel for fusedmoe in sm90 (sgl-project#8678) * Fix the missing 'lof' choice of --schedule-policy server args (sgl-project#7114) * fix args typo in memory_pool_host (sgl-project#8662) * [CI] Do not trigger pd-disaggregation CI in draft PR (sgl-project#8737) * [MoE] Enable `renormalize=False` in Triton kernels (sgl-project#8735) * Replace torch.jit.script with torch.compile in get_masked_input_and_mask to fix benchmark underreporting (sgl-project#8733) * Fix bug of refactoring TopKOutput in w4afp8 (sgl-project#8745) * Rename lora_path to lora_id in batches (sgl-project#8437) * [sgl-kernel] avoid per_token_quant_fp8.cu hardcode sm_count (sgl-project#8738) * [CI] Ascend NPU CI enhancement (sgl-project#8294) * [bugfix] fix import path in HiCacheController (sgl-project#8749)

hebiao064 added 2 commits July 29, 2025 14:09

fx hidden dim divide by 16

41771e7

fix

2279040

gemini-code-assist Bot reviewed Jul 29, 2025

View reviewed changes

hebiao064 changed the title ~~[Not Ready for Review] fix per token hidden 16~~ [Not Ready for Review] fix per token cuda kernel hidden dim cannot divide by 16 Jul 29, 2025

fix

589b7df

hebiao064 marked this pull request as ready for review July 29, 2025 21:47

hebiao064 requested review from BBuf, FlamingoPg, HaiShaw, HandH1998, ispobock, merrymercy, yizhang2077 and zhyncs as code owners July 29, 2025 21:47

hebiao064 mentioned this pull request Jul 29, 2025

[Bug] Exception: Capture cuda graph failed: Hidden dimension must be divisible by 16, but got 1368 #8460

Closed

5 tasks

hebiao064 changed the title ~~[Not Ready for Review] fix per token cuda kernel hidden dim cannot divide by 16~~ [Not Ready for merge] fix per token cuda kernel hidden dim cannot divide by 16 Jul 29, 2025

revert compilation

212a89b

hebiao064 changed the title ~~[Not Ready for merge] fix per token cuda kernel hidden dim cannot divide by 16~~ fix per token cuda kernel hidden dim cannot divide by 16 Jul 31, 2025

hebiao064 added 3 commits July 30, 2025 22:43

Merge branch 'main' into bhe/fix_per_token_hidden_16

badacea

fix test

6607941

Merge branch 'bhe/fix_per_token_hidden_16' of https://github.com/sgl-…

3624cd6

…project/sglang into bhe/fix_per_token_hidden_16

BBuf approved these changes Jul 31, 2025

View reviewed changes

Merge branch 'main' into bhe/fix_per_token_hidden_16

b49bc3e

Merge branch 'main' into bhe/fix_per_token_hidden_16

3307446

hebiao064 added the ready-to-merge The PR is ready to merge after the CI is green. label Aug 1, 2025

hebiao064 assigned BBuf Aug 1, 2025

Merge branch 'main' into bhe/fix_per_token_hidden_16

076d5a5

hebiao064 merged commit db7343c into main Aug 1, 2025
50 of 55 checks passed

hebiao064 deleted the bhe/fix_per_token_hidden_16 branch August 1, 2025 16:27

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

fix per token cuda kernel hidden dim cannot divide by 16 (sgl-project…

941db99

…#8543)

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

fix per token cuda kernel hidden dim cannot divide by 16 (sgl-project…

4a86cfd

…#8543)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix per token cuda kernel hidden dim cannot divide by 16#8543

fix per token cuda kernel hidden dim cannot divide by 16#8543
hebiao064 merged 10 commits intomainfrom
bhe/fix_per_token_hidden_16

hebiao064 commented Jul 29, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BBuf commented Jul 30, 2025

Uh oh!

gemini-code-assist Bot commented Jul 31, 2025

Uh oh!

BBuf left a comment

Uh oh!

hebiao064 commented Jul 31, 2025

Uh oh!

Uh oh!

hebiao064 commented Aug 2, 2025

Uh oh!

hebiao064 commented Aug 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hebiao064 commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

File Changes

Motivation

E2E Test

Before this PR:

After the PR

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BBuf commented Jul 30, 2025

Uh oh!

gemini-code-assist Bot commented Jul 31, 2025

Uh oh!

BBuf left a comment

Choose a reason for hiding this comment

Uh oh!

hebiao064 commented Jul 31, 2025

Uh oh!

Uh oh!

hebiao064 commented Aug 2, 2025

Uh oh!

hebiao064 commented Aug 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hebiao064 commented Jul 29, 2025 •

edited

Loading