Skip to content

fix radix cache match#7

Merged
merrymercy merged 1 commit intomainfrom
ls-fix
Jan 15, 2024
Merged

fix radix cache match#7
merrymercy merged 1 commit intomainfrom
ls-fix

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

No description provided.

@merrymercy merrymercy merged commit 01ca82d into main Jan 15, 2024
@merrymercy merrymercy deleted the ls-fix branch January 15, 2024 17:42
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request Mar 11, 2025
…TP size (sgl-project#7)

* support the case where num_attention_heads can't be divided evenly by tp_size

* refactor

* move  cpu specific logic to cpu_utils.py

* only set padded weights to zero
chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request Mar 14, 2025
…TP size (sgl-project#7)

* support the case where num_attention_heads can't be divided evenly by tp_size

* refactor

* move  cpu specific logic to cpu_utils.py

* only set padded weights to zero
chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request Mar 14, 2025
…TP size (sgl-project#7)

* support the case where num_attention_heads can't be divided evenly by tp_size

* refactor

* move  cpu specific logic to cpu_utils.py

* only set padded weights to zero
chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request Mar 14, 2025
…TP size (sgl-project#7)

* support the case where num_attention_heads can't be divided evenly by tp_size

* refactor

* move  cpu specific logic to cpu_utils.py

* only set padded weights to zero
yanbing-j pushed a commit to yanbing-j/sglang that referenced this pull request May 12, 2025
…TP size (sgl-project#7)

* support the case where num_attention_heads can't be divided evenly by tp_size

* refactor

* move  cpu specific logic to cpu_utils.py

* only set padded weights to zero
chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request May 28, 2025
…TP size (sgl-project#7)

* support the case where num_attention_heads can't be divided evenly by tp_size

* refactor

* move  cpu specific logic to cpu_utils.py

* only set padded weights to zero
chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request May 29, 2025
…TP size (sgl-project#7)

* support the case where num_attention_heads can't be divided evenly by tp_size

* refactor

* move  cpu specific logic to cpu_utils.py

* only set padded weights to zero
chunyuan-w added a commit to chunyuan-w/sglang that referenced this pull request Jun 3, 2025
…TP size (sgl-project#7)

* support the case where num_attention_heads can't be divided evenly by tp_size

* refactor

* move  cpu specific logic to cpu_utils.py

* only set padded weights to zero
nithinsubbiah pushed a commit to nithinsubbiah/sglang that referenced this pull request Nov 21, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (sgl-project#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (sgl-project#6)" (sgl-project#7)

This reverts commit eac4599.

Wave Backend decode (sgl-project#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (sgl-project#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (sgl-project#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (sgl-project#14)

Set unique cache dir for each worker (sgl-project#16)

update kernel (sgl-project#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (sgl-project#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (sgl-project#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (sgl-project#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (sgl-project#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (sgl-project#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
apinge pushed a commit to apinge/sglang that referenced this pull request Nov 26, 2025
[FIX] fix fuse share expert in EP
yhyang201 pushed a commit that referenced this pull request Dec 13, 2025
triple-mu pushed a commit to triple-mu/sglang that referenced this pull request Jan 1, 2026
# This is the 1st commit message:

rebase

# This is the commit message sgl-project#2:

remove duplicated code

# This is the commit message sgl-project#3:

add type hints

# This is the commit message sgl-project#4:

add clear cache for benchmark alignment

# This is the commit message sgl-project#5:

remove unuse arg

# This is the commit message sgl-project#6:

clear cache once

# This is the commit message sgl-project#7:

simplified VAE cache logic for qwenimage and wan

# This is the commit message sgl-project#8:

remove duplicated code
tpoisonooo pushed a commit to tpoisonooo/sglang that referenced this pull request Feb 12, 2026
@alisonshao alisonshao mentioned this pull request Mar 1, 2026
21 tasks
Estrella-xx pushed a commit to Estrella-xx/sglang that referenced this pull request Mar 10, 2026
* fix layernorm forward_npu for ascend with fsdp

* fix ascend sampling backend

* fsdp support ascend sampling backend

* fix RMSNorm for fsdp

* fix sampler for fsdp
0-693 added a commit to 0-693/sglang that referenced this pull request Mar 16, 2026
lawrence-harmonic added a commit to lawrence-harmonic/sglang that referenced this pull request Mar 19, 2026
apinge pushed a commit to apinge/sglang that referenced this pull request Mar 31, 2026
* Add SGLang MI355X CI

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

* Rename original workflow as sglang_benchmark_workflow_mi350x.yaml

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

* Change workflow run order

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

* Revert "Change workflow run order"

This reverts commit 2342d91.

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

* Update workflow name and model directory for MI355x

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

* Fix MI355 workflow

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

* Fix echo message

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

---------

Signed-off-by: Xiake Sun <xiake.sun@amd.com>
mmangkad pushed a commit to mmangkad-dev/sglang that referenced this pull request Apr 3, 2026
MoE support along with related weight_loader fix
wisclmy0611 pushed a commit that referenced this pull request Apr 7, 2026
Updated the SGLang Mintlify documentation guide to include project-specific details, writing standards, and best practices for documentation.
michaelzhang-ai added a commit that referenced this pull request Apr 11, 2026
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on
MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch)
benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM.

Engine fixes (cherry-picked from PR #22543 by ColinZ22):
- Add packed_modules_mapping to DeepseekV2ForCausalLM for Quark
  exclude-layer name resolution (gate_up_proj -> [gate_proj, up_proj])
- Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM

Test files:
- test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py
- test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py

Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and
nightly-test-amd-rocm720.yml

Verified: GSM8K accuracy 0.93+ on MI35x (run #7 passed)
https://github.com/sgl-project/sglang/actions/runs/24268460251

Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
michaelzhang-ai added a commit that referenced this pull request Apr 11, 2026
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on
MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch)
benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM.

Engine fixes (cherry-picked from PR #22543 by ColinZ22):
- Add packed_modules_mapping to DeepseekV2ForCausalLM for Quark
  exclude-layer name resolution (gate_up_proj -> [gate_proj, up_proj])
- Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM

Test files:
- test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py
- test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py

Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and
nightly-test-amd-rocm720.yml

Verified: GSM8K accuracy 0.93+ on MI35x (run #7 passed)
https://github.com/sgl-project/sglang/actions/runs/24268460251

Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
michaelzhang-ai added a commit that referenced this pull request Apr 11, 2026
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on
MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch)
benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM.

Engine fixes (cherry-picked from PR #22543 by ColinZ22):
- Add packed_modules_mapping to DeepseekV2ForCausalLM for Quark
  exclude-layer name resolution (gate_up_proj -> [gate_proj, up_proj])
- Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM

Test files:
- test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py
- test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py

Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and
nightly-test-amd-rocm720.yml

Verified: GSM8K accuracy 0.93+ on MI35x (run #7 passed)
https://github.com/sgl-project/sglang/actions/runs/24268460251

Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
michaelzhang-ai added a commit that referenced this pull request Apr 13, 2026
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on
MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch)
benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM.

Engine fixes (cherry-picked from PR #22543 by ColinZ22):
- Add packed_modules_mapping to DeepseekV2ForCausalLM for Quark
  exclude-layer name resolution (gate_up_proj -> [gate_proj, up_proj])
- Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM

Test files:
- test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py
- test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py

Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and
nightly-test-amd-rocm720.yml

Verified: GSM8K accuracy 0.93+ on MI35x (run #7 passed)
https://github.com/sgl-project/sglang/actions/runs/24268460251

Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
michaelzhang-ai added a commit that referenced this pull request Apr 13, 2026
Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on
MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch)
benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM.

Engine fixes (cherry-picked from PR #22543 by ColinZ22):
- Add packed_modules_mapping to DeepseekV2ForCausalLM for Quark
  exclude-layer name resolution (gate_up_proj -> [gate_proj, up_proj])
- Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM

Test files:
- test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py
- test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py

Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and
nightly-test-amd-rocm720.yml

Verified: GSM8K accuracy 0.93+ on MI35x (run #7 passed)
https://github.com/sgl-project/sglang/actions/runs/24268460251

Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>
JohnQinAMD added a commit to JohnQinAMD/sglang-amd that referenced this pull request Apr 28, 2026
…router gemm prefill

Stack of 4 kernel-level optimizations on top of commit 811f975. Each was
microbench-validated before integration:

  Pro-Base c=8 OSL=1024 rrr=0.8 (chi2762, TP=8):
    Original baseline:   91.31 tok/s, TPOT 84.15 ms, TTFT 822 ms
    After 811f975:    104.57 tok/s, TPOT 73.41 ms, TTFT 825 ms
    After this commit: 108.44 tok/s, TPOT 70.88 ms, TTFT 700 ms
                       (+18.8% / -15.8% / -14.8% vs original)

Per-GPU at c=8: 6.51 → 13.56 tok/s/GPU vs prior published baseline = 2.08x.

1. apply_rotary_emb_triton multi-token-per-CTA (deepseek_v4_rope.py)
   - Original kernel grid was (M, n_heads, ceil(rope/2/BLOCK_SIZE)) — at prefill
     M=8192 / n_heads=16 produces 131,072 CTAs of 32 elements each (~256 bytes
     of work per CTA). Profile measured 9.8% efficiency vs HBM bound.
   - New kernel takes BLOCK_M=64 tokens per CTA. Grid (ceil(M/64), n_heads).
     At Pro prefill: 2,048 CTAs of 4096 elements each. Microbench:
     61.92 → 29.99 us (2.07x at M=8192). Neutral at decode (M<=64), 2.07x at
     prefill, +12% at intermediate M=4096.
   - Correctness validated cos_sim 1.000000 across BLOCK_M ∈ {16,32,64,128,256}.

2. hc_pre fused Triton kernel (deepseek_v4.py:_hc_pre_fused_kernel)
   - Eager Python `_hc_pre_torch_impl` runs 4 separate kernels:
     `x.flatten(1).float()` (bf16→fp32 copy) + `square().mean()` (mul + reduce) +
     `F.linear(x_flat, hc_fn)` (thin-N=4 hipBLASLt) + `* rsqrt` (broadcast mul).
   - Triton fused kernel does it in a single pass: K-loop loads x once, writes
     bf16→fp32 cast to x_flat_out, accumulates sum_sq AND hc_fn @ x_flat
     simultaneously, then applies rsqrt at the end.
   - Microbench at M=8192: 316.33 → 111.70 us (2.83x). cos_sim 1.000000.
   - Shape-guarded (HC_MULT * HC_DIM == HIDDEN, hc_fn.shape == (HC_MULT, HIDDEN)).
     Falls back to torch impl otherwise.

3. hc_post fused Triton kernel (deepseek_v4.py:_hc_post_fused_kernel)
   - Eager Python `_hc_post_torch_impl` materializes a
     `(M, HC_MULT, HC_MULT, HIDDEN)` fp32 intermediate before sum — that's
     3.75 GB at Pro M=8192, allocated and freed per call.
   - Triton fused kernel keeps the per-(m, d) accumulator in registers; for
     each output column hc_out, computes `post[:, hc_out] * x` plus a sum
     over hc_in of `comb[m, hc_in, hc_out] * residual[m, hc_in, :]`.
   - **Microbench at M=8192: 5444 → 236 us (23.02x). 3.6% → 82.4% HBM eff.**
   - Correctness cos_sim 1.000001 max_diff 0.0625 (within bf16 noise).
   - This is the dominant lift in this commit; e2e TTFT savings of -125 ms at
     c=8 maps to ~5 sec of prefill saved across the 80-prompt bench.

4. Router gemm prefill config override (rocm_linear_utils.py)
   - At PREFILL (M=8192 > 256), aiter_dsv3_router_gemm goes through the
     non-atomic gemm_a16w16. aiter's default config (BLOCK_M=256, BLOCK_N=256)
     over-tiles N=384; microbench-tuned override (BLOCK_M=128, BLOCK_N=128,
     GROUP_M=4) for the (M=8192, N=384, K=7168) shape: 171.79 → 71.37 us (2.41x).
   - Shape-guarded N==384 K==7168 so it doesn't touch other call sites.
   - Complements the existing decode-side router gemm override (committed in
     811f975) — together they cover both decode and prefill router calls.

Per-module efficiency at prefill (M=8192) — all four DSv4 model variants share
the same worst-efficiency op (`hc_post` at 3.6%) because the eager Python op
materializes a giant intermediate. Fixing it ships to all variants since the
fused kernel is architecture-agnostic. See dsv4-4variant-scan.md for the full
matrix and dsv4-module-efficiency.md for per-op roofline analysis.

Untouched modules (already near roofline or fix is multi-day): rmsnorm 63%,
qkv_lora_a 65%, attn_proj_a8w8 42%. Top remaining open targets: the 19 long
GPU-idle gaps from dsv4-bottleneck-systematic.md (likely scheduler-level)
and the per-layer elementwise tail fusion (item sgl-project#7 in the optimization
plan, multi-week work).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants