chore: bump sgl-kernel version to 0.4.1.post1 by sglang-bot · Pull Request #23720 · sgl-project/sglang

sglang-bot · 2026-04-25T16:54:56Z

Summary

This PR bumps the sgl-kernel version to 0.4.1.post1 across all relevant files.

Files Updated

sgl-kernel/pyproject.toml
sgl-kernel/pyproject_cpu.toml
sgl-kernel/pyproject_musa.toml
sgl-kernel/pyproject_rocm.toml
sgl-kernel/python/sgl_kernel/version.py

🤖 Generated with GitHub Actions

This commit updates the sgl-kernel version across all relevant files: - sgl-kernel/pyproject.toml - sgl-kernel/pyproject_cpu.toml - sgl-kernel/pyproject_musa.toml - sgl-kernel/pyproject_rocm.toml - sgl-kernel/python/sgl_kernel/version.py 🤖 Generated with GitHub Actions

gemini-code-assist

Code Review

This pull request updates the version of sglang-kernel to 0.4.1.post1 across multiple configuration files and the version definition. The review feedback highlights the need to update the version pin in the main python/pyproject.toml file for consistency and suggests correcting a GPU classifier in the ROCm configuration file to properly identify the hardware.

gemini-code-assist · 2026-04-25T16:56:45Z

 [project]
 name = "sglang-kernel"
-version = "0.4.1"
+version = "0.4.1.post1"


The version bump to 0.4.1.post1 should also be reflected in the main python/pyproject.toml file, which currently pins sglang-kernel==0.4.1 at line 62. Updating this dependency is necessary for the main package to utilize the new kernel version.

gemini-code-assist · 2026-04-25T16:56:45Z

 [project]
 name = "sglang-kernel"
-version = "0.4.1"
+version = "0.4.1.post1"


While updating the version, consider also correcting the Environment :: GPU :: NVIDIA CUDA classifier on line 20 of this file. Since this is the ROCm-specific configuration, it should likely refer to ROCm or AMD instead of NVIDIA CUDA.

Kangyan-Zhou · 2026-04-25T16:57:43Z

/tag-and-rerun-ci

Kangyan-Zhou · 2026-04-25T18:27:23Z

/rerun-stage stage-c-test-4-gpu-h100

Kangyan-Zhou · 2026-04-25T18:27:42Z

/rerun-stage stage-c-test-8-gpu-h200

github-actions · 2026-04-25T18:27:48Z

✅ Triggered stage-c-test-4-gpu-h100 to run independently (skipping dependencies). View workflow run

Kangyan-Zhou · 2026-04-25T18:28:02Z

/rerun-stage stage-c-test-8-gpu-h20

github-actions · 2026-04-25T18:28:09Z

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

github-actions · 2026-04-25T18:28:28Z

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies). View workflow run

Kangyan-Zhou · 2026-04-25T18:36:59Z

/rerun-stage stage-c-test-4-gpu-b200

github-actions · 2026-04-25T18:37:25Z

✅ Triggered stage-c-test-4-gpu-b200 to run independently (skipping dependencies). View workflow run

Kangyan-Zhou · 2026-04-25T18:37:58Z

/rerun-stage stage-c-test-4-gpu-b200-small

github-actions · 2026-04-25T18:38:27Z

✅ Triggered stage-c-test-4-gpu-b200-small to run independently (skipping dependencies). View workflow run

The Nemotron-3-Nano stage-b CI tests are failing on main, not due to this sgl-kernel bump. Disable them in the registry until the underlying issue is fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n test The Phase-3 renormalize block in `grouped_topk_single_group_kernel` called `warp_sum_f32` (which uses `__shfl_xor_sync(0xffffffff, ...)`) from inside `if (lane_id < topk)`. With `topk` < 32 (e.g. nemotron-3-nano: topk=6), only lanes 0..topk-1 reached the intrinsic, but the mask 0xffffffff named all 32 lanes. CUDA spec: every lane named in the mask must execute the intrinsic at the same site, otherwise the result is undefined. Empirically the UB returned values from the absent lanes' registers, producing wrong renormalized weights — 2 of 6 weights per token were unnormalized (~1.5x too large). The wrong values were tolerated in eager inference, but under piecewise CUDA graph replay they cascaded into a downstream OOB that surfaced as IMA at `piecewise_cuda_graph_runner.py:794` on `TestNvidiaNemotron3Nano30BFP8.test_lm_eval`. Fix: move the warp_sum out of the divergent `if`, have all 32 lanes participate, with inactive lanes contributing the additive identity (0). Output writes remain gated by `if (lane_id < topk)`. Validated: - Unit sweep across E in {16..512}, K in {1..8}, N in {1..128}: matches reference biased_grouped_topk_impl with max diff < 1e-7. - 2x H200 e2e: TestNvidiaNemotron3Nano30BFP8.test_lm_eval passes (gsm8k strict=0.839, flexible=0.542, both within rtol=0.08). - Buggy kernel + eager (no graphs) also passes — confirming the kernel itself doesn't fault, only the cascade-under-graph-replay does. This is the surgical alternative to #23758, which reverts the entire #23533 (~4000 lines). The model code, tool/reasoning parsers, and tuned MoE configs from #23533 are not part of the bug. Also re-enables `test_nvidia_nemotron_3_nano` (the stop-gap disable was added in #23720 when this IMA started showing up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Kangyan Zhou <kangyan.zhou@radixark.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sglang-bot requested review from BBuf, FlamingoPg, HaiShaw, ispobock, merrymercy and yizhang2077 as code owners April 25, 2026 16:54

github-actions Bot added amd dependencies Pull requests that update a dependency file sgl-kernel mthreads labels Apr 25, 2026

gemini-code-assist Bot reviewed Apr 25, 2026

View reviewed changes

github-actions Bot added the run-ci label Apr 25, 2026

Kangyan-Zhou mentioned this pull request Apr 25, 2026

[CI] sgl-kernel: prune dangling images before each wheel build #23723

Merged

3 tasks

[CI] temporarily skip test_nvidia_nemotron_3_nano

0c723b1

The Nemotron-3-Nano stage-b CI tests are failing on main, not due to this sgl-kernel bump. Disable them in the registry until the underlying issue is fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Kangyan-Zhou force-pushed the bot/bump-kernel-version-0.4.1.post1-640e branch from d8dc252 to 0c723b1 Compare April 25, 2026 19:02

Kangyan-Zhou merged commit 7141735 into main Apr 26, 2026
245 of 289 checks passed

Kangyan-Zhou deleted the bot/bump-kernel-version-0.4.1.post1-640e branch April 26, 2026 00:13

alisonshao mentioned this pull request Apr 26, 2026

Revert #23533 (Hy3 preview) + re-enable test_nvidia_nemotron_3_nano #23758

Closed

2 tasks

Kangyan-Zhou mentioned this pull request Apr 26, 2026

[MoE] Fix warp-shfl UB in grouped_topk renormalize #23774

Closed

3 tasks

Conversation

sglang-bot commented Apr 25, 2026

Summary

Files Updated

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Kangyan-Zhou commented Apr 25, 2026

Uh oh!

Kangyan-Zhou commented Apr 25, 2026

Uh oh!

Kangyan-Zhou commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Kangyan-Zhou commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Kangyan-Zhou commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Kangyan-Zhou commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants