[v0.5.10][5] Disable piecewise cuda graph to avoid NVLS oom by yueming-yuan · Pull Request #935 · radixark/miles

yueming-yuan · 2026-04-07T05:40:15Z

Summary

SGLang v0.5.10's flashinfer allreduce fusion switched from IPC shared memory to NVLS multicast
In colocate mode, this exhausts NVSwitch multicast handle slots, causing Megatron's NCCL init to fail with Failed to bind NVLink SHARP (NVLS) Multicast memory
Add --sglang-enforce-disable-flashinfer-allreduce-fusion to affected test scripts
Add FAQ entry documenting the issue and fix

Root cause

flashinfer.comm.create_allreduce_fusion_workspace(backend="trtllm") in v0.5.10 uses NVLS multicast (vs IPC in v0.5.9). These handles persist through SGLang sleep and are not released for Megatron.

DeepEP tests pass because they set enable_flashinfer_allreduce_fusion=False.

Test plan

test_qwen3_30B_A3B_r3.py (non-DeepEP) passes
test_qwen3_30B_A3B.py (bf16, bridge) passes

…g-v0.5.10" This reverts commit 14a0cdb, reversing changes made to 3cf7a9f.

…ng-v0.5.10" This reverts commit d549b26.

…ndles

… NVLS handles" This reverts commit 1b1e83f.

…ndles

gemini-code-assist

Code Review

This pull request addresses NCCL 'out of memory' errors related to NVLS multicast handles by introducing the --sglang-enforce-disable-flashinfer-allreduce-fusion flag in documentation and E2E tests. Feedback suggests adding the specific SGLang version (v0.5.10) to the documentation for clarity and ensuring a trailing newline is present in the markdown file.

gemini-code-assist · 2026-04-07T05:45:42Z

+
+14. **NCCL error: `Failed to bind NVLink SHARP (NVLS) Multicast memory ... CUDA error 2 'out of memory'` in colocate mode.**
+
+    This happens when SGLang's flashinfer allreduce fusion allocates NVLS (NVLink SHARP) multicast handles on the NVSwitch, leaving insufficient handles for Megatron's training NCCL. NVSwitch has a hardware-limited number of multicast handle slots. In colocate mode, SGLang and Megatron share the same GPUs, and SGLang's NVLS handles are not released during sleep.


The explanation mentions that this issue happens when SGLang's flashinfer allreduce fusion allocates NVLS handles. It would be beneficial to specify that this behavior was introduced in SGLang v0.5.10 (switching from IPC to NVLS), as users on older versions might not encounter this or might find the flag missing.

gemini-code-assist · 2026-04-07T05:45:42Z

+
+    Fix: add `--sglang-enforce-disable-flashinfer-allreduce-fusion` to your training command. This disables flashinfer's NVLS-based allreduce fusion in SGLang, reserving NVLS handles for Megatron training.
+
+    This issue has been observed on H100 with Qwen3-30B-A3B in colocate mode.


There is a missing newline at the end of the file. Adding a trailing newline is a standard practice for text files to ensure compatibility with various tools and to follow POSIX standards.

Suggested change

This issue has been observed on H100 with Qwen3-30B-A3B in colocate mode.

This issue has been observed on H100 with Qwen3-30B-A3B in colocate mode.

- Update qa.md docs - Add disable_piecewise_cuda_graph argument - Adjust rollout worker cuda graph config

Remove models broken by transformers v5 tokenizer unification (DeepSeek-V3, step3, glm-4-9b-chat) and track them in a TOOL_CALL_KNOWN_FAILURES list with root cause comments. Add new passing models: Qwen3.5, Qwen3-Coder-Next, GLM-4.7-Flash, Kimi-K2.5, MiniMax-M2.5, Nemotron-3-Super. Clean up debug helpers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

transformers >=5.0 changed apply_chat_template(tokenize=True) to return BatchEncoding instead of list[int]. Pass return_dict=False to all 6 call sites in mask_utils.py to ensure list[int] on both v4 and v5. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move Step-3.5-Flash from known failures into active tool-call test models, and clarify comments for remaining transformers v5 tokenizer/template incompatibilities. Made-with: Cursor

…-fusion

…5-v2

…eturn-tensors

…dules-set

…-fusion

- Revert CI image back to radixark/miles:dev - Revert SGLANG_PR default back to sglang-miles - Revert SGLANG_BRANCH back to sglang-miles - Revert Megatron-Bridge back to merged-megatron-0.16.0rc0-miles

…eturn-tensors

…dules-set

…-fusion

…#935)

…region clusters (#10) * Revert "[BUGFIX] [P2PRDMA] Add rollout post-processing after P2PRDMA weight updates" (radixark#882) * [Fix] fix ci (radixark#894) * Avoid threading for ray getting object (radixark#886) * Add explicit errors for unsupported Megatron profiles (radixark#887) * Add nvfp4 quantizer files (radixark#907) * Bump flash-linear-attention version to 0.4.2 (radixark#892) * [BUGFIX] Invoke "post_process_quantization" by default after weight updating (radixark#890) Co-authored-by: Yueming Yuan <yym022502@gmail.com> * Add heartbeat and id to session server (radixark#866) * fix: adding thin glm5 image to docker build + latest tag sync (radixark#871) * Add consistent hashing routing policy for rollout (radixark#891) Co-authored-by: Yueming Yuan <yueming@Mac.attlocal.net> * [example] add retool v2 example with multi-turn framework interfaces (radixark#654) Co-authored-by: GuanxingLu <gxlu02@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Expose rollout-batch-size, n-samples-per-prompt, global-batch-size as CLI args in swe-agent-v2 (radixark#954) Co-authored-by: Shi Dong <shi.dong@radixark.ai> * chore: remove obsolete swe-agent server.py and run-qwen3.sh (radixark#952) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add weight staleness control for fully async rollout (radixark#958) * Fix/pause generation mode (radixark#924) Co-authored-by: Yueming Yuan <yym022502@gmail.com> * [v0.5.10][1] Bump sglang to v0.5.10 (radixark#898) * [v0.5.10][2] Fix apply_chat_template behavior for transformers >=5.0 (radixark#926) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [v0.5.10][3] Fix processor return_tensors duplicate kwarg for transformers >=5.0 (radixark#927) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [v0.5.10][4] Fix _no_split_modules set not subscriptable in transformers >=5.0 (radixark#931) * [v0.5.10][5] Disable piecewise cuda graph to avoid NVLS oom (radixark#935) * [v0.5.10][6][FSDP] fix outdated weight update logic in FSDP (radixark#948) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [v0.5.10][7][FSDP] move FSDP to experimental and disable by default (radixark#961) * Add skiplist and more robust calculation on val (radixark#965) * [fix] tiny fix debug rollout only in weight version check (radixark#967) * feat: real cp support with relayout fix for qwen3.5 train/rollout mismatch (radixark#885) * [AMD] Upgrade to sglv0.5.10 (radixark#973) * switch model to actor (radixark#756) * [fix] support general logic to bypass fp32 downcast and fix qwen35 A_log dtype (radixark#975) Co-authored-by: yueming-yuan <yym022502@gmail.com> * fix: populate prefix_cache_info in OpenAI/session rollout path (radixark#960) * Remove prepare_harbor_tasks.py; use harbor-private adapters (radixark#982) * [fix] Skip flush_cache in in_place mode and add fully async example (radixark#974) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * GLM47 full cmd for async and sync reasoning (radixark#986) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: handle non-tool appended messages in TITO incremental tokenization (radixark#949) Co-authored-by: Yanbin Jiang <jybsuper@gmail.com> * [docker] Add sgl-model-gateway install and download .tar.gz assets (radixark#895) * [ci] fix hf rate limit error by caching tokenizer loading (radixark#1014) Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> * Use load_generate_function in legacy sglang_rollout path (radixark#1016) * Update CODEOWNERS to add new reviewers (radixark#1021) * Support moe lora for gpt-oss (radixark#798) Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> * [fix] restore expert_bias to fp32 before bridge weight export (radixark#811) * [chore] drop legacy transformers upgrade pin for glm47-flash and qwen35 (radixark#1018) * [fix] Enforce param dtype before wrap ddp (radixark#992) Co-authored-by: Zhichen Zeng <zczeng@uw.edu> * [upgrade] update Megatron-Bridge source and LoRA CI to megatron e2e tests and (radixark#1023) * [CI] Drop --use-miles-router from R3 tests and add r3 comparasion test between sgl & miles router (radixark#1015) * wandb: raise init_timeout, add retry wrapper, fix shared-mode init for cross-region clusters In online + shared mode, both `init_wandb_primary` and `init_wandb_secondary` make HTTPS round-trips to wandb cloud (login + run create/attach). On high-latency cross-region clusters (e.g. Abu Dhabi MBZUAI ↔ wandb-cloud US-West) with concurrent actor bursts, a single round-trip can exceed the wandb SDK's 90s default `init_timeout` — tearing down the whole run with a silent handshake abort. Observed on RL360 job 1564420, which forced `WANDB_MODE=offline` as a global default ever since (see https://github.com/LLM360/RL360/issues/87). The issue's original diagnosis assumed a local primary↔secondary socket handshake race. That's not how shared mode works — per wandb's own feature PR (wandb/wandb#6882), each writer spawns an independent wandb-core that talks to the cloud directly; aggregation is server-side by run_id. No local socket exists. The failure mode is pure network/latency, not a local readiness race. Changes ------- - Bump `init_timeout` to 300s for primary and secondary Settings. Configurable via `WANDB_INIT_TIMEOUT_SECS` env var for tuning. - Wrap both init paths in a bounded exponential-backoff retry (`_wandb_init_with_retry`) that re-attempts on wandb.errors.CommError and wandb.errors.UsageError. 3 attempts with 5→10→20s backoff by default, tunable via `WANDB_INIT_RETRY_ATTEMPTS` / `WANDB_INIT_RETRY_BACKOFF_SECS`. - Add `x_label` tagging per wandb distributed-training docs: primary gets `rank_<rank>_primary`, secondaries get `rank_<rank>_secondary`. Enables per-rank console-log filtering in the wandb UI. - Drop `reinit=True` from secondary init_kwargs. Shared mode natively supports concurrent writers on a single run; `reinit=True` triggered stale-state warnings on secondary actors without functional benefit. Followups this change enables ----------------------------- - `WANDB_MODE=offline` can be removed from scale.yaml's extra_env default once a pilot run confirms online mode boots cleanly. - The tmux-based `~/bin/wandb-sync-rl360.sh` workaround on David's M2 account becomes obsolete (no more offline-only default). - Near-realtime wandb dashboards replace the ~2-minute-lag offline sync; per-rank system metrics via x_label filtering. --------- Co-authored-by: JD <jaedon.guo@gmail.com> Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ziang Li <ziangli@umich.edu> Co-authored-by: Zhichen Zeng <zczeng@uw.edu> Co-authored-by: JensenFire <xinji1@microsoft.com> Co-authored-by: Yueming Yuan <yym022502@gmail.com> Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> Co-authored-by: Douglas Yang <douglasyang88@gmail.com> Co-authored-by: Yueming Yuan <yueming@Mac.attlocal.net> Co-authored-by: Huapeng Zhou <73010314+PopSoda2002@users.noreply.github.com> Co-authored-by: GuanxingLu <gxlu02@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Shi-Dong <Shi-Dong@users.noreply.github.com> Co-authored-by: Shi Dong <shi.dong@radixark.ai> Co-authored-by: Jiajun Li <48857426+guapisolo@users.noreply.github.com> Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com> Co-authored-by: Yanbin Jiang <jybsuper@gmail.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: Yisheng Gong <yishenggong9437@gmail.com>

yueming-yuan added 17 commits April 5, 2026 18:44

Bump sglang to v0.5.10

38458fc

Fix PyJWT uninstall conflict with v0.5.10 base image

fee8ee2

Fix PyJWT: rm debian package files before pip install

dedb1fa

Retrigger CI

6d0a8e3

Fix FastAPI add_event_handler removed in newer versions

ffd110b

Fix OpenAIServingChat missing reasoning_parser in test mock

3cf7a9f

Fix mask_utils for transformers >=5.0 BatchEncoding return type

44e83de

Format

1a2d5e8

Merge branch 'fix/mask-utils-transformers-v5' into bump-sglang-v0.5.10

14a0cdb

Revert "Merge branch 'fix/mask-utils-transformers-v5' into bump-sglan…

d549b26

…g-v0.5.10" This reverts commit 14a0cdb, reversing changes made to 3cf7a9f.

Reapply "Merge branch 'fix/mask-utils-transformers-v5' into bump-sgla…

ec3cb15

…ng-v0.5.10" This reverts commit d549b26.

Fix processor return_tensors duplicate kwarg for transformers >=5.0

a723a03

Fix _no_split_modules set not subscriptable in transformers >=5.0

48bbcf6

Use Megatron-Bridge fix/rope-theta branch for transformers 5.x compat

fbec694

Disable flashinfer allreduce fusion for H100 colocate to save NVLS ha…

1b1e83f

…ndles

Revert "Disable flashinfer allreduce fusion for H100 colocate to save…

e848032

… NVLS handles" This reverts commit 1b1e83f.

Disable flashinfer allreduce fusion for H100 colocate to save NVLS ha…

e79f721

…ndles

yueming-yuan added the run-ci-megatron label Apr 7, 2026

gemini-code-assist Bot reviewed Apr 7, 2026

View reviewed changes

Switch NVLS workaround to disable piecewise cuda graph

73800a7

yueming-yuan changed the title ~~[v0.5.10][5] Disable flashinfer allreduce fusion for H100 colocate to save NVLS handles~~ [v0.5.10][5] Disable piecewise cuda graph to avoid NVLS oom Apr 7, 2026

update

93f8e16

yueming-yuan requested review from fzyzcjy, guapisolo and maocheng23 as code owners April 7, 2026 20:51

Disable flashinfer allreduce fusion and piecewise cuda graph for NVLS

d71b7cf

- Update qa.md docs - Add disable_piecewise_cuda_graph argument - Adjust rollout worker cuda graph config

yueming-yuan force-pushed the fix/nvls-flashinfer-fusion branch from 2bfaa2f to d71b7cf Compare April 7, 2026 21:54

clean

eb93dac

yueming-yuan removed the run-ci-megatron label Apr 7, 2026

guapisolo and others added 11 commits April 8, 2026 00:08

Fix Step-3.5-Flash tool-call tokenizer test classification

9cb379a

Move Step-3.5-Flash from known failures into active tool-call test models, and clarify comments for remaining transformers v5 tokenizer/template incompatibilities. Made-with: Cursor

Merge branch 'bump-sglang-v0.5.10' into fix/processor-return-tensors

82a2787

Merge branch 'bump-sglang-v0.5.10' into fix/fsdp-no-split-modules-set

d30954a

Merge branch 'fix/fsdp-no-split-modules-set' into fix/nvls-flashinfer…

e406216

…-fusion

add warning

6578e30

fmt

2137397

Merge branch 'bump-sglang-v0.5.10' into fix/mask-utils-transformers-v…

0a0786a

…5-v2

Merge branch 'fix/mask-utils-transformers-v5-v2' into fix/processor-r…

f000d74

…eturn-tensors

Merge branch 'fix/processor-return-tensors' into fix/fsdp-no-split-mo…

1ba790f

…dules-set

Merge branch 'fix/fsdp-no-split-modules-set' into fix/nvls-flashinfer…

58fa496

…-fusion

maocheng23 approved these changes Apr 9, 2026

View reviewed changes

yueming-yuan added 5 commits April 9, 2026 12:44

Merge branch 'main' into fix/mask-utils-transformers-v5-v2

51c5785

Revert CI test docker/branch and Dockerfile changes

9918d0a

- Revert CI image back to radixark/miles:dev - Revert SGLANG_PR default back to sglang-miles - Revert SGLANG_BRANCH back to sglang-miles - Revert Megatron-Bridge back to merged-megatron-0.16.0rc0-miles

Merge branch 'fix/mask-utils-transformers-v5-v2' into fix/processor-r…

b04589e

…eturn-tensors

Merge branch 'fix/processor-return-tensors' into fix/fsdp-no-split-mo…

509d43f

…dules-set

Merge branch 'fix/fsdp-no-split-modules-set' into fix/nvls-flashinfer…

8ead980

…-fusion

Base automatically changed from fix/fsdp-no-split-modules-set to main April 9, 2026 19:50

yueming-yuan merged commit c74392d into main Apr 9, 2026
14 of 15 checks passed

yueming-yuan deleted the fix/nvls-flashinfer-fusion branch April 9, 2026 19:50

GuanxingLu mentioned this pull request Apr 14, 2026

AttributeError: sglang_enforce_piecewise_cuda_graph not defined when using --colocate (regression from #935) #978

Closed

GuanxingLu pushed a commit to GuanxingLu/miles that referenced this pull request Apr 21, 2026

[v0.5.10][5] Disable piecewise cuda graph to avoid NVLS oom (radixark…

8708288

…#935)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.5.10][5] Disable piecewise cuda graph to avoid NVLS oom#935

[v0.5.10][5] Disable piecewise cuda graph to avoid NVLS oom#935
yueming-yuan merged 38 commits intomainfrom
fix/nvls-flashinfer-fusion

yueming-yuan commented Apr 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		14. NCCL error: `Failed to bind NVLink SHARP (NVLS) Multicast memory ... CUDA error 2 'out of memory'` in colocate mode.

		This happens when SGLang's flashinfer allreduce fusion allocates NVLS (NVLink SHARP) multicast handles on the NVSwitch, leaving insufficient handles for Megatron's training NCCL. NVSwitch has a hardware-limited number of multicast handle slots. In colocate mode, SGLang and Megatron share the same GPUs, and SGLang's NVLS handles are not released during sleep.


		Fix: add `--sglang-enforce-disable-flashinfer-allreduce-fusion` to your training command. This disables flashinfer's NVLS-based allreduce fusion in SGLang, reserving NVLS handles for Megatron training.

		This issue has been observed on H100 with Qwen3-30B-A3B in colocate mode. No newline at end of file

Conversation

yueming-yuan commented Apr 7, 2026

Summary

Root cause

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants