Revert #23533 (Hy3 preview) + re-enable test_nvidia_nemotron_3_nano#23758
Closed
alisonshao wants to merge 2 commits intomainfrom
Closed
Revert #23533 (Hy3 preview) + re-enable test_nvidia_nemotron_3_nano#23758alisonshao wants to merge 2 commits intomainfrom
alisonshao wants to merge 2 commits intomainfrom
Conversation
This reverts commit 6d03861.
Contributor
There was a problem hiding this comment.
Code Review
This pull request removes the implementation and support for the Hunyuan-V3 (HYV3) model architecture, including its specialized MoE routing kernels, function call detectors, reasoning parsers, and model configurations. Additionally, it re-enables a test for the Nemotron-3-Nano model that was previously disabled. I have no feedback to provide as there are no review comments to assess.
Collaborator
Author
|
/rerun-test test_nvidia_nemotron_3_nano.py |
Contributor
|
✅ |
3 tasks
Kangyan-Zhou
added a commit
that referenced
this pull request
Apr 26, 2026
…n test
The Phase-3 renormalize block in `grouped_topk_single_group_kernel` called
`warp_sum_f32` (which uses `__shfl_xor_sync(0xffffffff, ...)`) from inside
`if (lane_id < topk)`. With `topk` < 32 (e.g. nemotron-3-nano: topk=6), only
lanes 0..topk-1 reached the intrinsic, but the mask 0xffffffff named all 32
lanes. CUDA spec: every lane named in the mask must execute the intrinsic
at the same site, otherwise the result is undefined.
Empirically the UB returned values from the absent lanes' registers,
producing wrong renormalized weights — 2 of 6 weights per token were
unnormalized (~1.5x too large). The wrong values were tolerated in eager
inference, but under piecewise CUDA graph replay they cascaded into a
downstream OOB that surfaced as IMA at `piecewise_cuda_graph_runner.py:794`
on `TestNvidiaNemotron3Nano30BFP8.test_lm_eval`.
Fix: move the warp_sum out of the divergent `if`, have all 32 lanes
participate, with inactive lanes contributing the additive identity (0).
Output writes remain gated by `if (lane_id < topk)`.
Validated:
- Unit sweep across E in {16..512}, K in {1..8}, N in {1..128}: matches
reference biased_grouped_topk_impl with max diff < 1e-7.
- 2x H200 e2e: TestNvidiaNemotron3Nano30BFP8.test_lm_eval passes
(gsm8k strict=0.839, flexible=0.542, both within rtol=0.08).
- Buggy kernel + eager (no graphs) also passes — confirming the kernel
itself doesn't fault, only the cascade-under-graph-replay does.
This is the surgical alternative to #23758, which reverts the entire
#23533 (~4000 lines). The model code, tool/reasoning parsers, and tuned
MoE configs from #23533 are not part of the bug.
Also re-enables `test_nvidia_nemotron_3_nano` (the stop-gap disable was
added in #23720 when this IMA started showing up).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reverts #23533 and re-enables
test_nvidia_nemotron_3_nano, which #23720 disabled as a stop-gap when scheduledpr-teststarted failing.Bisected the failure (
Fatal Python error: Abortedfrompiecewise_cuda_graph_runner.py:794during FP8 nemotron decode, surfaced asTriton Error [CUDA]: an illegal memory accessin_static_quant_fp8) on a 2x H200 againstTestNvidiaNemotron3Nano30BFP8.test_lm_eval. First-bad commit is6d0386147(#23533); the parent (6344b546c) ran gsm8k=0.850 cleanly. Failure example: https://github.com/sgl-project/sglang/actions/runs/24936337295/job/73022450777.#23533 added a new
grouped_topk_single_group_kerneland wired it in for any single-group MoE with ≤512 experts and topk≤8 (python/sglang/srt/layers/moe/topk.py). Nemotron-3-Nano-A3B falls into that gate. The kernel corrupts CUDA state, and the next sync point —_static_quant_fp8in the FP8 path — surfaces the illegal access. The reason #23533's own CI was green is that its branch predated #22218 (Breakable Piecewise Cuda Graph) — each PR works alone, the combination on main does not.Conflicts during revert:
python/sglang/srt/models/hunyuan_v3.py: deleted (Apply should_use_dp_reduce_scatterv guard to remaining MoE models (follow-up to #23731) #23732 added a one-line guard here; the file only existed because of support Hy3 preview #23533).docs/basic_usage/hy3_preview.md: left as an orphan (pre-commit blocks deletions in legacydocs/).topk.py,server_args.py: auto-merged.Reland #23533 once the new kernel is audited against the breakable-PCG runner.
Test plan
stage-b-test-2-gpu-large (2)runs the re-enabled test and reports gsm8k≈0.85.hunyuan_v3,hunyuan_v3_nextn,grouped_topk,hunyuan_detector).