Skip to content

chore: bump sgl-kernel version to 0.4.1.post1#23720

Merged
Kangyan-Zhou merged 2 commits intomainfrom
bot/bump-kernel-version-0.4.1.post1-640e
Apr 26, 2026
Merged

chore: bump sgl-kernel version to 0.4.1.post1#23720
Kangyan-Zhou merged 2 commits intomainfrom
bot/bump-kernel-version-0.4.1.post1-640e

Conversation

@sglang-bot
Copy link
Copy Markdown
Member

Summary

This PR bumps the sgl-kernel version to 0.4.1.post1 across all relevant files.

Files Updated

  • sgl-kernel/pyproject.toml
  • sgl-kernel/pyproject_cpu.toml
  • sgl-kernel/pyproject_musa.toml
  • sgl-kernel/pyproject_rocm.toml
  • sgl-kernel/python/sgl_kernel/version.py

🤖 Generated with GitHub Actions

This commit updates the sgl-kernel version across all relevant files:
          - sgl-kernel/pyproject.toml
          - sgl-kernel/pyproject_cpu.toml
          - sgl-kernel/pyproject_musa.toml
          - sgl-kernel/pyproject_rocm.toml
          - sgl-kernel/python/sgl_kernel/version.py

🤖 Generated with GitHub Actions
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the version of sglang-kernel to 0.4.1.post1 across multiple configuration files and the version definition. The review feedback highlights the need to update the version pin in the main python/pyproject.toml file for consistency and suggests correcting a GPU classifier in the ROCm configuration file to properly identify the hardware.

Comment thread sgl-kernel/pyproject.toml
[project]
name = "sglang-kernel"
version = "0.4.1"
version = "0.4.1.post1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The version bump to 0.4.1.post1 should also be reflected in the main python/pyproject.toml file, which currently pins sglang-kernel==0.4.1 at line 62. Updating this dependency is necessary for the main package to utilize the new kernel version.

[project]
name = "sglang-kernel"
version = "0.4.1"
version = "0.4.1.post1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While updating the version, consider also correcting the Environment :: GPU :: NVIDIA CUDA classifier on line 20 of this file. Since this is the ROCm-specific configuration, it should likely refer to ROCm or AMD instead of NVIDIA CUDA.

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

/rerun-stage stage-c-test-4-gpu-h100

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

/rerun-stage stage-c-test-8-gpu-h200

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-4-gpu-h100 to run independently (skipping dependencies). View workflow run

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

/rerun-stage stage-c-test-8-gpu-h20

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies). View workflow run

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

/rerun-stage stage-c-test-4-gpu-b200

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-4-gpu-b200 to run independently (skipping dependencies). View workflow run

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

/rerun-stage stage-c-test-4-gpu-b200-small

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-4-gpu-b200-small to run independently (skipping dependencies). View workflow run

The Nemotron-3-Nano stage-b CI tests are failing on main, not due to
this sgl-kernel bump. Disable them in the registry until the underlying
issue is fixed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Kangyan-Zhou Kangyan-Zhou force-pushed the bot/bump-kernel-version-0.4.1.post1-640e branch from d8dc252 to 0c723b1 Compare April 25, 2026 19:02
@Kangyan-Zhou Kangyan-Zhou merged commit 7141735 into main Apr 26, 2026
245 of 289 checks passed
@Kangyan-Zhou Kangyan-Zhou deleted the bot/bump-kernel-version-0.4.1.post1-640e branch April 26, 2026 00:13
Kangyan-Zhou added a commit that referenced this pull request Apr 26, 2026
…n test

The Phase-3 renormalize block in `grouped_topk_single_group_kernel` called
`warp_sum_f32` (which uses `__shfl_xor_sync(0xffffffff, ...)`) from inside
`if (lane_id < topk)`. With `topk` < 32 (e.g. nemotron-3-nano: topk=6), only
lanes 0..topk-1 reached the intrinsic, but the mask 0xffffffff named all 32
lanes. CUDA spec: every lane named in the mask must execute the intrinsic
at the same site, otherwise the result is undefined.

Empirically the UB returned values from the absent lanes' registers,
producing wrong renormalized weights — 2 of 6 weights per token were
unnormalized (~1.5x too large). The wrong values were tolerated in eager
inference, but under piecewise CUDA graph replay they cascaded into a
downstream OOB that surfaced as IMA at `piecewise_cuda_graph_runner.py:794`
on `TestNvidiaNemotron3Nano30BFP8.test_lm_eval`.

Fix: move the warp_sum out of the divergent `if`, have all 32 lanes
participate, with inactive lanes contributing the additive identity (0).
Output writes remain gated by `if (lane_id < topk)`.

Validated:
- Unit sweep across E in {16..512}, K in {1..8}, N in {1..128}: matches
  reference biased_grouped_topk_impl with max diff < 1e-7.
- 2x H200 e2e: TestNvidiaNemotron3Nano30BFP8.test_lm_eval passes
  (gsm8k strict=0.839, flexible=0.542, both within rtol=0.08).
- Buggy kernel + eager (no graphs) also passes — confirming the kernel
  itself doesn't fault, only the cascade-under-graph-replay does.

This is the surgical alternative to #23758, which reverts the entire
#23533 (~4000 lines). The model code, tool/reasoning parsers, and tuned
MoE configs from #23533 are not part of the bug.

Also re-enables `test_nvidia_nemotron_3_nano` (the stop-gap disable was
added in #23720 when this IMA started showing up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
Co-authored-by: Kangyan Zhou <kangyan.zhou@radixark.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd dependencies Pull requests that update a dependency file mthreads run-ci sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants