Add section in docs showing how to fix training hang#5852
Conversation
There was a problem hiding this comment.
👋 Isaac Lab Review Bot
Thanks for adding this troubleshooting guide, @StafaH!
Review Summary:
✅ Technical accuracy — The explanation of NCCL P2P transport behavior with CUDA_VISIBLE_DEVICES is correct. The workaround (NCCL_P2P_DISABLE=1) is the standard fix for this known issue.
✅ RST formatting — Proper use of section anchors, code blocks, and note directives. The heading level (^^^) correctly nests under the Troubleshooting section.
✅ Placement — Good location in multi_gpu.rst alongside other GPU troubleshooting content, right before the Multi-Node section.
✅ Changelog — Fragment follows the correct format in changelog.d/.
Minor observation:
The note at the end appropriately warns about the bandwidth tradeoff — helpful for users to understand when not to apply this workaround universally.
LGTM! 🚀
Update (cc2bd58): Nice simplification! The changes consolidate the CUDA_VISIBLE_DEVICES hang workaround directly into the existing NCCL troubleshooting flow rather than a separate subsection. The detailed "Why this is only needed" explanation has been trimmed, and the bandwidth tradeoff warning is now merged into a single unified note. This makes the doc more concise while preserving the essential information. Still LGTM! ✅
Update (dbb3a90): Base branch sync only — no changes to the PR's documentation files. Review status unchanged. ✅
Greptile SummaryThis PR adds documentation to the multi-GPU troubleshooting section explaining that restricting visible GPUs via
Confidence Score: 5/5Documentation-only change with accurate, well-scoped troubleshooting guidance — no code is modified. Both changed files are documentation. The new NCCL_P2P_DISABLE=1 workaround is technically correct (it disables P2P transport and routes through host memory), the performance caveat is accurately communicated in the note, and the changelog fragment matches what was added. No logic, APIs, or runtime behaviour is touched. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Start Multi-GPU Training] --> B{CUDA_VISIBLE_DEVICES\nset to a subset?}
B -- No --> C[Training proceeds normally]
B -- Yes --> D{Training hangs\nwith no error?}
D -- No --> C
D -- Yes --> E[Set NCCL_P2P_DISABLE=1]
E --> F[Relaunch distributed\ntraining command]
F --> G[Training uses host/shared\nmemory instead of P2P]
G --> H[Hang resolved\n⚠ Reduced bandwidth]
style E fill:#f9c74f,stroke:#f8961e
style H fill:#90be6d,stroke:#43aa8b
Reviews (2): Last reviewed commit: "Merge branch 'develop' into mh/multigpu_..." | Re-trigger Greptile |
Description
Update to docs to highlight how to solve a potential multigpu training issue (not a bug but a known issue)
Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there