Add section in docs showing how to fix training hang by StafaH · Pull Request #5852 · isaac-sim/IsaacLab

StafaH · 2026-05-29T00:48:43Z

Description

Update to docs to highlight how to solve a potential multigpu training issue (not a bug but a known issue)

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

isaaclab-review-bot

👋 Isaac Lab Review Bot

Thanks for adding this troubleshooting guide, @StafaH!

Review Summary:

✅ Technical accuracy — The explanation of NCCL P2P transport behavior with CUDA_VISIBLE_DEVICES is correct. The workaround (NCCL_P2P_DISABLE=1) is the standard fix for this known issue.

✅ RST formatting — Proper use of section anchors, code blocks, and note directives. The heading level (^^^) correctly nests under the Troubleshooting section.

✅ Placement — Good location in multi_gpu.rst alongside other GPU troubleshooting content, right before the Multi-Node section.

✅ Changelog — Fragment follows the correct format in changelog.d/.

Minor observation:
The note at the end appropriately warns about the bandwidth tradeoff — helpful for users to understand when not to apply this workaround universally.

LGTM! 🚀

Update (cc2bd58): Nice simplification! The changes consolidate the CUDA_VISIBLE_DEVICES hang workaround directly into the existing NCCL troubleshooting flow rather than a separate subsection. The detailed "Why this is only needed" explanation has been trimmed, and the bandwidth tradeoff warning is now merged into a single unified note. This makes the doc more concise while preserving the essential information. Still LGTM! ✅

Update (dbb3a90): Base branch sync only — no changes to the PR's documentation files. Review status unchanged. ✅

greptile-apps · 2026-05-29T00:50:17Z

Greptile Summary

This PR adds documentation to the multi-GPU troubleshooting section explaining that restricting visible GPUs via CUDA_VISIBLE_DEVICES can cause a silent training hang, and that NCCL_P2P_DISABLE=1 resolves it. A changelog fragment is also included.

Extends docs/source/features/multi_gpu.rst with a new paragraph, code block, and updated note covering the CUDA_VISIBLE_DEVICES-induced hang and the NCCL_P2P_DISABLE=1 workaround with its bandwidth trade-off.
Adds source/isaaclab/changelog.d/mh-multigpu-nccl-p2p-docs.rst as the corresponding changelog entry.

Confidence Score: 5/5

Documentation-only change with accurate, well-scoped troubleshooting guidance — no code is modified.

Both changed files are documentation. The new NCCL_P2P_DISABLE=1 workaround is technically correct (it disables P2P transport and routes through host memory), the performance caveat is accurately communicated in the note, and the changelog fragment matches what was added. No logic, APIs, or runtime behaviour is touched.

No files require special attention.

Important Files Changed

Filename	Overview
docs/source/features/multi_gpu.rst	Adds a new paragraph and code block to the NCCL troubleshooting section documenting the CUDA_VISIBLE_DEVICES subset hang and the NCCL_P2P_DISABLE=1 workaround; also expands the closing note with a performance caveat.
source/isaaclab/changelog.d/mh-multigpu-nccl-p2p-docs.rst	New changelog fragment accurately describing the added troubleshooting documentation for the NCCL P2P hang.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Start Multi-GPU Training] --> B{CUDA_VISIBLE_DEVICES\nset to a subset?}
    B -- No --> C[Training proceeds normally]
    B -- Yes --> D{Training hangs\nwith no error?}
    D -- No --> C
    D -- Yes --> E[Set NCCL_P2P_DISABLE=1]
    E --> F[Relaunch distributed\ntraining command]
    F --> G[Training uses host/shared\nmemory instead of P2P]
    G --> H[Hang resolved\n⚠ Reduced bandwidth]

    style E fill:#f9c74f,stroke:#f8961e
    style H fill:#90be6d,stroke:#43aa8b

_{Reviews (2): Last reviewed commit: "Merge branch 'develop' into mh/multigpu_..." | Re-trigger Greptile}

# Description Cherry pick bug fixes from develop: - #5838 - #5852 - #5869 --------- Co-authored-by: Antoine RICHARD <antoiner@nvidia.com> Co-authored-by: Mustafa H <34825877+StafaH@users.noreply.github.com> Co-authored-by: Frank Lai NV <frlai@nvidia.com>

Add section in docs showing how to fix training hang

dd6ffd3

StafaH requested review from Mayankm96, jtigue-bdai and kellyguo11 as code owners May 29, 2026 00:48

github-actions Bot added documentation Improvements or additions to documentation isaac-lab Related to Isaac Lab team labels May 29, 2026

isaaclab-review-bot Bot reviewed May 29, 2026

View reviewed changes

StafaH added 2 commits May 28, 2026 18:32

Update docs

edd486f

Precommit

cc2bd58

kellyguo11 approved these changes May 29, 2026

View reviewed changes

StafaH closed this May 29, 2026

StafaH deleted the mh/multigpu_docs_faq branch May 29, 2026 20:13

StafaH restored the mh/multigpu_docs_faq branch May 29, 2026 20:13

StafaH reopened this May 29, 2026

Merge branch 'develop' into mh/multigpu_docs_faq

dbb3a90

StafaH merged commit 73c4f34 into isaac-sim:develop May 30, 2026
60 of 61 checks passed

StafaH deleted the mh/multigpu_docs_faq branch May 30, 2026 00:16

kellyguo11 mentioned this pull request May 30, 2026

Cherry-picks fixes for SKRL, multi-GPU docs, LEAPP imports #5876

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add section in docs showing how to fix training hang#5852

Add section in docs showing how to fix training hang#5852
StafaH merged 4 commits into
isaac-sim:developfrom
StafaH:mh/multigpu_docs_faq

StafaH commented May 29, 2026

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

greptile-apps Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

StafaH commented May 29, 2026

Description

Checklist

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

isaaclab-review-bot Bot left a comment •

edited

Loading

greptile-apps Bot commented May 29, 2026 •

edited

Loading