fix radix cache match by hnyls2002 · Pull Request #7 · sgl-project/sglang

hnyls2002 · 2024-01-15T17:33:59Z

No description provided.

…TP size (sgl-project#7) * support the case where num_attention_heads can't be divided evenly by tp_size * refactor * move cpu specific logic to cpu_utils.py * only set padded weights to zero

Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Add wave extend attention kernel Signed-off-by: Harsh Menon <harsh@nod-labs.com> [Wave] Adding logit_cap and layer scaling to API Also add support for the wave backend to the model runner. And use Triton decode kernels for now. [Wave] Run chunked prefill for perf comparison on Wave test Need to rename the non chunked/regular prefill version because otherwise rpd will treat it as the same kernel Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Cache the function that loads the wave kernel Also maintain a global kernel hash to avoid recomputing the hash on every call. [Wave] Don't specify block size and enable buffer ops [Wave] Enable wave runtime and update scheduling API [Wave] Update API to use wave_compile & WaveCompileOptions [Wave] Update wave backend and extend attention to latest [Wave] Add speculative decode kernel Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com> cache kernels using lru_cache Update WaveBackend to use Wave Decode (sgl-project#6) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Revert "Update WaveBackend to use Wave Decode (sgl-project#6)" (sgl-project#7) This reverts commit eac4599. Wave Backend decode (sgl-project#8) * align shapes Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> * fix Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Wave backend fixes (sgl-project#10) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> More fixes to Wave decode (sgl-project#12) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> is_causal Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Enable the grok in3 model (sgl-project#14) Set unique cache dir for each worker (sgl-project#16) update kernel (sgl-project#18) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> updated spec decode test as per wave Signed-off-by: xintin <gaurav.verma@amd.com> fix extend (sgl-project#23) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Refactor paged decode intermediate arrays shapes (sgl-project#24) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> remove dyn symbols (sgl-project#26) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> cleanup shapes (sgl-project#27) Some fields were removed from `paged_decode_attention_shape`. Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Remove `mha` param from Wave decode attention kernel (sgl-project#28) Depends on iree-org/iree-turbine#1039 Signed-off-by: Paul Zhang <paul.zhang@amd.com> nfc: fix problems reported by linting update references from iree.turbine to wave_lang

[FIX] fix fuse share expert in EP

# This is the 1st commit message: rebase # This is the commit message sgl-project#2: remove duplicated code # This is the commit message sgl-project#3: add type hints # This is the commit message sgl-project#4: add clear cache for benchmark alignment # This is the commit message sgl-project#5: remove unuse arg # This is the commit message sgl-project#6: clear cache once # This is the commit message sgl-project#7: simplified VAE cache logic for qwenimage and wan # This is the commit message sgl-project#8: remove duplicated code

…hunk Support graph chunk

* fix layernorm forward_npu for ascend with fsdp * fix ascend sampling backend * fsdp support ascend sampling backend * fix RMSNorm for fsdp * fix sampler for fsdp

* Add SGLang MI355X CI Signed-off-by: Xiake Sun <xiake.sun@amd.com> * Rename original workflow as sglang_benchmark_workflow_mi350x.yaml Signed-off-by: Xiake Sun <xiake.sun@amd.com> * Change workflow run order Signed-off-by: Xiake Sun <xiake.sun@amd.com> * Revert "Change workflow run order" This reverts commit 2342d91. Signed-off-by: Xiake Sun <xiake.sun@amd.com> * Update workflow name and model directory for MI355x Signed-off-by: Xiake Sun <xiake.sun@amd.com> * Fix MI355 workflow Signed-off-by: Xiake Sun <xiake.sun@amd.com> * Fix echo message Signed-off-by: Xiake Sun <xiake.sun@amd.com> --------- Signed-off-by: Xiake Sun <xiake.sun@amd.com>

MoE support along with related weight_loader fix

Updated the SGLang Mintlify documentation guide to include project-specific details, writing standards, and best practices for documentation.

Add nightly CI tests for amd/GLM-5-MXFP4 (Quark MXFP4 quantized) on MI35x GPUs with accuracy (GSM8K) and performance (bench_one_batch) benchmarks, plus engine fixes to enable Quark MXFP4 on GlmMoeDsaForCausalLM. Engine fixes (cherry-picked from PR #22543 by ColinZ22): - Add packed_modules_mapping to DeepseekV2ForCausalLM for Quark exclude-layer name resolution (gate_up_proj -> [gate_proj, up_proj]) - Guard quark_post_load_weights to only run on DeepseekV3ForCausalLM Test files: - test/registered/amd/accuracy/mi35x/test_glm5_mxfp4_eval_mi35x.py - test/registered/amd/perf/mi35x/test_glm5_mxfp4_perf_mi35x.py Workflow: combined accuracy+perf jobs in nightly-test-amd.yml and nightly-test-amd-rocm720.yml Verified: GSM8K accuracy 0.93+ on MI35x (run #7 passed) https://github.com/sgl-project/sglang/actions/runs/24268460251 Co-authored-by: ColinZ22 <ColinZ22@users.noreply.github.com>

…router gemm prefill Stack of 4 kernel-level optimizations on top of commit 811f975. Each was microbench-validated before integration: Pro-Base c=8 OSL=1024 rrr=0.8 (chi2762, TP=8): Original baseline: 91.31 tok/s, TPOT 84.15 ms, TTFT 822 ms After 811f975: 104.57 tok/s, TPOT 73.41 ms, TTFT 825 ms After this commit: 108.44 tok/s, TPOT 70.88 ms, TTFT 700 ms (+18.8% / -15.8% / -14.8% vs original) Per-GPU at c=8: 6.51 → 13.56 tok/s/GPU vs prior published baseline = 2.08x. 1. apply_rotary_emb_triton multi-token-per-CTA (deepseek_v4_rope.py) - Original kernel grid was (M, n_heads, ceil(rope/2/BLOCK_SIZE)) — at prefill M=8192 / n_heads=16 produces 131,072 CTAs of 32 elements each (~256 bytes of work per CTA). Profile measured 9.8% efficiency vs HBM bound. - New kernel takes BLOCK_M=64 tokens per CTA. Grid (ceil(M/64), n_heads). At Pro prefill: 2,048 CTAs of 4096 elements each. Microbench: 61.92 → 29.99 us (2.07x at M=8192). Neutral at decode (M<=64), 2.07x at prefill, +12% at intermediate M=4096. - Correctness validated cos_sim 1.000000 across BLOCK_M ∈ {16,32,64,128,256}. 2. hc_pre fused Triton kernel (deepseek_v4.py:_hc_pre_fused_kernel) - Eager Python `_hc_pre_torch_impl` runs 4 separate kernels: `x.flatten(1).float()` (bf16→fp32 copy) + `square().mean()` (mul + reduce) + `F.linear(x_flat, hc_fn)` (thin-N=4 hipBLASLt) + `* rsqrt` (broadcast mul). - Triton fused kernel does it in a single pass: K-loop loads x once, writes bf16→fp32 cast to x_flat_out, accumulates sum_sq AND hc_fn @ x_flat simultaneously, then applies rsqrt at the end. - Microbench at M=8192: 316.33 → 111.70 us (2.83x). cos_sim 1.000000. - Shape-guarded (HC_MULT * HC_DIM == HIDDEN, hc_fn.shape == (HC_MULT, HIDDEN)). Falls back to torch impl otherwise. 3. hc_post fused Triton kernel (deepseek_v4.py:_hc_post_fused_kernel) - Eager Python `_hc_post_torch_impl` materializes a `(M, HC_MULT, HC_MULT, HIDDEN)` fp32 intermediate before sum — that's 3.75 GB at Pro M=8192, allocated and freed per call. - Triton fused kernel keeps the per-(m, d) accumulator in registers; for each output column hc_out, computes `post[:, hc_out] * x` plus a sum over hc_in of `comb[m, hc_in, hc_out] * residual[m, hc_in, :]`. - **Microbench at M=8192: 5444 → 236 us (23.02x). 3.6% → 82.4% HBM eff.** - Correctness cos_sim 1.000001 max_diff 0.0625 (within bf16 noise). - This is the dominant lift in this commit; e2e TTFT savings of -125 ms at c=8 maps to ~5 sec of prefill saved across the 80-prompt bench. 4. Router gemm prefill config override (rocm_linear_utils.py) - At PREFILL (M=8192 > 256), aiter_dsv3_router_gemm goes through the non-atomic gemm_a16w16. aiter's default config (BLOCK_M=256, BLOCK_N=256) over-tiles N=384; microbench-tuned override (BLOCK_M=128, BLOCK_N=128, GROUP_M=4) for the (M=8192, N=384, K=7168) shape: 171.79 → 71.37 us (2.41x). - Shape-guarded N==384 K==7168 so it doesn't touch other call sites. - Complements the existing decode-side router gemm override (committed in 811f975) — together they cover both decode and prefill router calls. Per-module efficiency at prefill (M=8192) — all four DSv4 model variants share the same worst-efficiency op (`hc_post` at 3.6%) because the eager Python op materializes a giant intermediate. Fixing it ships to all variants since the fused kernel is architecture-agnostic. See dsv4-4variant-scan.md for the full matrix and dsv4-module-efficiency.md for per-op roofline analysis. Untouched modules (already near roofline or fix is multi-day): rmsnorm 63%, qkv_lora_a 65%, attn_proj_a8w8 42%. Top remaining open targets: the 19 long GPU-idle gaps from dsv4-bottleneck-systematic.md (likely scheduler-level) and the per-layer elementwise tail fusion (item sgl-project#7 in the optimization plan, multi-week work). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fixed radix cache match bug

4d1f45e

merrymercy merged commit 01ca82d into main Jan 15, 2024

merrymercy deleted the ls-fix branch January 15, 2024 17:42

wonderisland mentioned this pull request Sep 19, 2024

[Bug] illegal memory access encountered #1467

Closed

5 tasks

CSEEduanyu mentioned this pull request Jan 26, 2025

[Bug] NCCL Crash with SIGSEGV Frequently when deploying deepseek v3 #2803

Closed

5 tasks

lambert0312 mentioned this pull request Feb 18, 2025

Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 #3582

Merged

ToughK mentioned this pull request Feb 18, 2025

[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100 #3658

Closed

5 tasks

mahaocong90 mentioned this pull request Feb 26, 2025

[Bug] H20 8 gpu x 2 with --enable-dp-attention occurred CUDA error: an illegal memory access #3892

Closed

5 tasks

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

fix radix cache match (sgl-project#7)

8cc6f3a

This was referenced Apr 16, 2025

enable ci test: upstream ci for XPU DiweiSun/sglang#4

Closed

Enable CPU CI: upstream CI enabling with github workflow DiweiSun/sglang#3

Closed

Update README.md DiweiSun/sglang#1

Closed

riou-chen mentioned this pull request Apr 17, 2025

[Bug] run eagle3 failed #5448

Closed

dongyibo mentioned this pull request May 19, 2025

[Bug] eagle2【CUDA error: an illegal memory access was encountered】 #6309

Closed

5 tasks

yuleiqin mentioned this pull request May 26, 2025

[Bug] main pd version Exception: Failed to encode tensor map: 700 #6590

Closed

5 tasks

apinge pushed a commit to apinge/sglang that referenced this pull request Nov 26, 2025

Merge pull request sgl-project#7 from ZLkanyo009/dev/perf

781f337

[FIX] fix fuse share expert in EP

Copilot AI mentioned this pull request Dec 1, 2025

Add distributed parallelism parameters (tp_size, pp_size, dp_size) to throughput parameter generator keliangli/sglang#4

Merged

6 tasks

yhyang201 pushed a commit that referenced this pull request Dec 13, 2025

feat: add health/health_generate (#7)

eaaf840

tpoisonooo pushed a commit to tpoisonooo/sglang that referenced this pull request Feb 12, 2026

Merge pull request sgl-project#7 from GitHubstart0916/support_graph_c…

e9fdf6c

…hunk Support graph chunk

Martion-z mentioned this pull request Feb 13, 2026

[Bug] CUDA error: an illegal memory access was encountered with SGLang v0.5.8 + HiCache #18785

Closed

5 tasks

chenkaiyue mentioned this pull request Feb 28, 2026

Fix: Cuda Graph + HiCache + Speculative Decoding Working Together were giving Cuda Illegal memory access error. #19177

Open

alisonshao mentioned this pull request Mar 1, 2026

Upgrade transformers==5.3.0 #17784

Merged

21 tasks

zhaochenyang20 mentioned this pull request Mar 3, 2026

[SGLang-Diffusion] Add offline throughput benchmark script for multi-modal models #18154

Merged

5 tasks

putdanil mentioned this pull request Mar 4, 2026

[Bug] FLUX.2-dev FP8 transformer crashes with 4 reference images during denoising #19873

Closed

5 tasks

0xymoro mentioned this pull request Mar 6, 2026

[Bug] Illegal memory access on 0.5.9 nvfp4 #20011

Closed

5 tasks

0-693 added a commit to 0-693/sglang that referenced this pull request Mar 16, 2026

align to sglang base (sgl-project#7)

3dbe962

lawrence-harmonic added a commit to lawrence-harmonic/sglang that referenced this pull request Mar 19, 2026

backport sgl-project#19977 (sgl-project#7)

f16db0d

Jacob0226 mentioned this pull request Mar 26, 2026

[AMD] Fuse RMSNorm + FP8 per-token quant for GLM-4.7-FP8 #21403

Merged

lviy mentioned this pull request Mar 26, 2026

[Bug] Enablling DP-Attention cause 'nan' of 'inf' in prob tensor #21460

Open

5 tasks

mmangkad pushed a commit to mmangkad-dev/sglang that referenced this pull request Apr 3, 2026

Merge pull request sgl-project#7 from pyc96/kp/gemma-moe

eb5e40f

MoE support along with related weight_loader fix

twb1235 mentioned this pull request Apr 7, 2026

[Bug] I noticed that with the node 2 and pp 2 tp8 setup, the workers don't exit on their own when the master goes down. I have to kill them manually #22227

Open

5 tasks

wisclmy0611 pushed a commit that referenced this pull request Apr 7, 2026

Revise SGLang Mintlify AGENTS.md (#7)

1809ec3

Updated the SGLang Mintlify documentation guide to include project-specific details, writing standards, and best practices for documentation.

BBuf mentioned this pull request Apr 8, 2026

[SKILL] add torch profiler analysis workflow #22353

Merged

silencejade mentioned this pull request Apr 25, 2026

[NPU] Fix mrope_position computation in Eagle Worker v2 with PlanStream #23423

Open

5 tasks

JackLeeHal mentioned this pull request May 9, 2026

[Question] running DeepSeek-V4-Pro on B300 #24776

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix radix cache match#7

fix radix cache match#7
merrymercy merged 1 commit intomainfrom
ls-fix

hnyls2002 commented Jan 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hnyls2002 commented Jan 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants