fix radix cache match#7
Merged
merrymercy merged 1 commit intomainfrom Jan 15, 2024
Merged
Conversation
5 tasks
5 tasks
5 tasks
Closed
5 tasks
timethink
pushed a commit
to timethink/sglang
that referenced
this pull request
Mar 9, 2025
chunyuan-w
added a commit
to chunyuan-w/sglang
that referenced
this pull request
Mar 11, 2025
…TP size (sgl-project#7) * support the case where num_attention_heads can't be divided evenly by tp_size * refactor * move cpu specific logic to cpu_utils.py * only set padded weights to zero
chunyuan-w
added a commit
to chunyuan-w/sglang
that referenced
this pull request
Mar 14, 2025
…TP size (sgl-project#7) * support the case where num_attention_heads can't be divided evenly by tp_size * refactor * move cpu specific logic to cpu_utils.py * only set padded weights to zero
chunyuan-w
added a commit
to chunyuan-w/sglang
that referenced
this pull request
Mar 14, 2025
…TP size (sgl-project#7) * support the case where num_attention_heads can't be divided evenly by tp_size * refactor * move cpu specific logic to cpu_utils.py * only set padded weights to zero
chunyuan-w
added a commit
to chunyuan-w/sglang
that referenced
this pull request
Mar 14, 2025
…TP size (sgl-project#7) * support the case where num_attention_heads can't be divided evenly by tp_size * refactor * move cpu specific logic to cpu_utils.py * only set padded weights to zero
This was referenced Apr 16, 2025
yanbing-j
pushed a commit
to yanbing-j/sglang
that referenced
this pull request
May 12, 2025
…TP size (sgl-project#7) * support the case where num_attention_heads can't be divided evenly by tp_size * refactor * move cpu specific logic to cpu_utils.py * only set padded weights to zero
5 tasks
5 tasks
chunyuan-w
added a commit
to chunyuan-w/sglang
that referenced
this pull request
May 28, 2025
…TP size (sgl-project#7) * support the case where num_attention_heads can't be divided evenly by tp_size * refactor * move cpu specific logic to cpu_utils.py * only set padded weights to zero
chunyuan-w
added a commit
to chunyuan-w/sglang
that referenced
this pull request
May 29, 2025
…TP size (sgl-project#7) * support the case where num_attention_heads can't be divided evenly by tp_size * refactor * move cpu specific logic to cpu_utils.py * only set padded weights to zero
chunyuan-w
added a commit
to chunyuan-w/sglang
that referenced
this pull request
Jun 3, 2025
…TP size (sgl-project#7) * support the case where num_attention_heads can't be divided evenly by tp_size * refactor * move cpu specific logic to cpu_utils.py * only set padded weights to zero
nithinsubbiah
pushed a commit
to nithinsubbiah/sglang
that referenced
this pull request
Nov 21, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Add wave extend attention kernel Signed-off-by: Harsh Menon <harsh@nod-labs.com> [Wave] Adding logit_cap and layer scaling to API Also add support for the wave backend to the model runner. And use Triton decode kernels for now. [Wave] Run chunked prefill for perf comparison on Wave test Need to rename the non chunked/regular prefill version because otherwise rpd will treat it as the same kernel Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Cache the function that loads the wave kernel Also maintain a global kernel hash to avoid recomputing the hash on every call. [Wave] Don't specify block size and enable buffer ops [Wave] Enable wave runtime and update scheduling API [Wave] Update API to use wave_compile & WaveCompileOptions [Wave] Update wave backend and extend attention to latest [Wave] Add speculative decode kernel Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com> cache kernels using lru_cache Update WaveBackend to use Wave Decode (sgl-project#6) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Revert "Update WaveBackend to use Wave Decode (sgl-project#6)" (sgl-project#7) This reverts commit eac4599. Wave Backend decode (sgl-project#8) * align shapes Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> * fix Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Wave backend fixes (sgl-project#10) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> More fixes to Wave decode (sgl-project#12) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> is_causal Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Enable the grok in3 model (sgl-project#14) Set unique cache dir for each worker (sgl-project#16) update kernel (sgl-project#18) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> updated spec decode test as per wave Signed-off-by: xintin <gaurav.verma@amd.com> fix extend (sgl-project#23) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Refactor paged decode intermediate arrays shapes (sgl-project#24) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> remove dyn symbols (sgl-project#26) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> cleanup shapes (sgl-project#27) Some fields were removed from `paged_decode_attention_shape`. Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Remove `mha` param from Wave decode attention kernel (sgl-project#28) Depends on iree-org/iree-turbine#1039 Signed-off-by: Paul Zhang <paul.zhang@amd.com> nfc: fix problems reported by linting update references from iree.turbine to wave_lang
apinge
pushed a commit
to apinge/sglang
that referenced
this pull request
Nov 26, 2025
[FIX] fix fuse share expert in EP
6 tasks
yhyang201
pushed a commit
that referenced
this pull request
Dec 13, 2025
triple-mu
pushed a commit
to triple-mu/sglang
that referenced
this pull request
Jan 1, 2026
# This is the 1st commit message: rebase # This is the commit message sgl-project#2: remove duplicated code # This is the commit message sgl-project#3: add type hints # This is the commit message sgl-project#4: add clear cache for benchmark alignment # This is the commit message sgl-project#5: remove unuse arg # This is the commit message sgl-project#6: clear cache once # This is the commit message sgl-project#7: simplified VAE cache logic for qwenimage and wan # This is the commit message sgl-project#8: remove duplicated code
tpoisonooo
pushed a commit
to tpoisonooo/sglang
that referenced
this pull request
Feb 12, 2026
…hunk Support graph chunk
Closed
5 tasks
21 tasks
5 tasks
5 tasks
5 tasks
Estrella-xx
pushed a commit
to Estrella-xx/sglang
that referenced
this pull request
Mar 10, 2026
* fix layernorm forward_npu for ascend with fsdp * fix ascend sampling backend * fsdp support ascend sampling backend * fix RMSNorm for fsdp * fix sampler for fsdp
0-693
added a commit
to 0-693/sglang
that referenced
this pull request
Mar 16, 2026
lawrence-harmonic
added a commit
to lawrence-harmonic/sglang
that referenced
this pull request
Mar 19, 2026
5 tasks
apinge
pushed a commit
to apinge/sglang
that referenced
this pull request
Mar 31, 2026
* Add SGLang MI355X CI Signed-off-by: Xiake Sun <xiake.sun@amd.com> * Rename original workflow as sglang_benchmark_workflow_mi350x.yaml Signed-off-by: Xiake Sun <xiake.sun@amd.com> * Change workflow run order Signed-off-by: Xiake Sun <xiake.sun@amd.com> * Revert "Change workflow run order" This reverts commit 2342d91. Signed-off-by: Xiake Sun <xiake.sun@amd.com> * Update workflow name and model directory for MI355x Signed-off-by: Xiake Sun <xiake.sun@amd.com> * Fix MI355 workflow Signed-off-by: Xiake Sun <xiake.sun@amd.com> * Fix echo message Signed-off-by: Xiake Sun <xiake.sun@amd.com> --------- Signed-off-by: Xiake Sun <xiake.sun@amd.com>
mmangkad
pushed a commit
to mmangkad-dev/sglang
that referenced
this pull request
Apr 3, 2026
MoE support along with related weight_loader fix
wisclmy0611
pushed a commit
that referenced
this pull request
Apr 7, 2026
Updated the SGLang Mintlify documentation guide to include project-specific details, writing standards, and best practices for documentation.
michaelzhang-ai
added a commit
that referenced
this pull request
Apr 11, 2026
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch) benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM. Engine fixes (cherry-picked from PR #22543 by ColinZ22): - Add packed_modules_mapping to DeepseekV2ForCausalLM for Quark exclude-layer name resolution (gate_up_proj -> [gate_proj, up_proj]) - Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM Test files: - test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py - test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and nightly-test-amd-rocm720.yml Verified: GSM8K accuracy 0.93+ on MI35x (run #7 passed) https://github.com/sgl-project/sglang/actions/runs/24268460251 Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
michaelzhang-ai
added a commit
that referenced
this pull request
Apr 11, 2026
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch) benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM. Engine fixes (cherry-picked from PR #22543 by ColinZ22): - Add packed_modules_mapping to DeepseekV2ForCausalLM for Quark exclude-layer name resolution (gate_up_proj -> [gate_proj, up_proj]) - Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM Test files: - test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py - test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and nightly-test-amd-rocm720.yml Verified: GSM8K accuracy 0.93+ on MI35x (run #7 passed) https://github.com/sgl-project/sglang/actions/runs/24268460251 Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
michaelzhang-ai
added a commit
that referenced
this pull request
Apr 11, 2026
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch) benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM. Engine fixes (cherry-picked from PR #22543 by ColinZ22): - Add packed_modules_mapping to DeepseekV2ForCausalLM for Quark exclude-layer name resolution (gate_up_proj -> [gate_proj, up_proj]) - Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM Test files: - test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py - test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and nightly-test-amd-rocm720.yml Verified: GSM8K accuracy 0.93+ on MI35x (run #7 passed) https://github.com/sgl-project/sglang/actions/runs/24268460251 Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
michaelzhang-ai
added a commit
that referenced
this pull request
Apr 13, 2026
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch) benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM. Engine fixes (cherry-picked from PR #22543 by ColinZ22): - Add packed_modules_mapping to DeepseekV2ForCausalLM for Quark exclude-layer name resolution (gate_up_proj -> [gate_proj, up_proj]) - Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM Test files: - test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py - test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and nightly-test-amd-rocm720.yml Verified: GSM8K accuracy 0.93+ on MI35x (run #7 passed) https://github.com/sgl-project/sglang/actions/runs/24268460251 Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
michaelzhang-ai
added a commit
that referenced
this pull request
Apr 13, 2026
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch) benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM. Engine fixes (cherry-picked from PR #22543 by ColinZ22): - Add packed_modules_mapping to DeepseekV2ForCausalLM for Quark exclude-layer name resolution (gate_up_proj -> [gate_proj, up_proj]) - Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM Test files: - test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py - test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and nightly-test-amd-rocm720.yml Verified: GSM8K accuracy 0.93+ on MI35x (run #7 passed) https://github.com/sgl-project/sglang/actions/runs/24268460251 Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
5 tasks
JohnQinAMD
added a commit
to JohnQinAMD/sglang-amd
that referenced
this pull request
Apr 28, 2026
…router gemm prefill Stack of 4 kernel-level optimizations on top of commit 811f975. Each was microbench-validated before integration: Pro-Base c=8 OSL=1024 rrr=0.8 (chi2762, TP=8): Original baseline: 91.31 tok/s, TPOT 84.15 ms, TTFT 822 ms After 811f975: 104.57 tok/s, TPOT 73.41 ms, TTFT 825 ms After this commit: 108.44 tok/s, TPOT 70.88 ms, TTFT 700 ms (+18.8% / -15.8% / -14.8% vs original) Per-GPU at c=8: 6.51 → 13.56 tok/s/GPU vs prior published baseline = 2.08x. 1. apply_rotary_emb_triton multi-token-per-CTA (deepseek_v4_rope.py) - Original kernel grid was (M, n_heads, ceil(rope/2/BLOCK_SIZE)) — at prefill M=8192 / n_heads=16 produces 131,072 CTAs of 32 elements each (~256 bytes of work per CTA). Profile measured 9.8% efficiency vs HBM bound. - New kernel takes BLOCK_M=64 tokens per CTA. Grid (ceil(M/64), n_heads). At Pro prefill: 2,048 CTAs of 4096 elements each. Microbench: 61.92 → 29.99 us (2.07x at M=8192). Neutral at decode (M<=64), 2.07x at prefill, +12% at intermediate M=4096. - Correctness validated cos_sim 1.000000 across BLOCK_M ∈ {16,32,64,128,256}. 2. hc_pre fused Triton kernel (deepseek_v4.py:_hc_pre_fused_kernel) - Eager Python `_hc_pre_torch_impl` runs 4 separate kernels: `x.flatten(1).float()` (bf16→fp32 copy) + `square().mean()` (mul + reduce) + `F.linear(x_flat, hc_fn)` (thin-N=4 hipBLASLt) + `* rsqrt` (broadcast mul). - Triton fused kernel does it in a single pass: K-loop loads x once, writes bf16→fp32 cast to x_flat_out, accumulates sum_sq AND hc_fn @ x_flat simultaneously, then applies rsqrt at the end. - Microbench at M=8192: 316.33 → 111.70 us (2.83x). cos_sim 1.000000. - Shape-guarded (HC_MULT * HC_DIM == HIDDEN, hc_fn.shape == (HC_MULT, HIDDEN)). Falls back to torch impl otherwise. 3. hc_post fused Triton kernel (deepseek_v4.py:_hc_post_fused_kernel) - Eager Python `_hc_post_torch_impl` materializes a `(M, HC_MULT, HC_MULT, HIDDEN)` fp32 intermediate before sum — that's 3.75 GB at Pro M=8192, allocated and freed per call. - Triton fused kernel keeps the per-(m, d) accumulator in registers; for each output column hc_out, computes `post[:, hc_out] * x` plus a sum over hc_in of `comb[m, hc_in, hc_out] * residual[m, hc_in, :]`. - **Microbench at M=8192: 5444 → 236 us (23.02x). 3.6% → 82.4% HBM eff.** - Correctness cos_sim 1.000001 max_diff 0.0625 (within bf16 noise). - This is the dominant lift in this commit; e2e TTFT savings of -125 ms at c=8 maps to ~5 sec of prefill saved across the 80-prompt bench. 4. Router gemm prefill config override (rocm_linear_utils.py) - At PREFILL (M=8192 > 256), aiter_dsv3_router_gemm goes through the non-atomic gemm_a16w16. aiter's default config (BLOCK_M=256, BLOCK_N=256) over-tiles N=384; microbench-tuned override (BLOCK_M=128, BLOCK_N=128, GROUP_M=4) for the (M=8192, N=384, K=7168) shape: 171.79 → 71.37 us (2.41x). - Shape-guarded N==384 K==7168 so it doesn't touch other call sites. - Complements the existing decode-side router gemm override (committed in 811f975) — together they cover both decode and prefill router calls. Per-module efficiency at prefill (M=8192) — all four DSv4 model variants share the same worst-efficiency op (`hc_post` at 3.6%) because the eager Python op materializes a giant intermediate. Fixing it ships to all variants since the fused kernel is architecture-agnostic. See dsv4-4variant-scan.md for the full matrix and dsv4-module-efficiency.md for per-op roofline analysis. Untouched modules (already near roofline or fix is multi-day): rmsnorm 63%, qkv_lora_a 65%, attn_proj_a8w8 42%. Top remaining open targets: the 19 long GPU-idle gaps from dsv4-bottleneck-systematic.md (likely scheduler-level) and the per-layer elementwise tail fusion (item sgl-project#7 in the optimization plan, multi-week work). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.