Skip to content

Add section in docs showing how to fix training hang#5852

Merged
StafaH merged 4 commits into
isaac-sim:developfrom
StafaH:mh/multigpu_docs_faq
May 30, 2026
Merged

Add section in docs showing how to fix training hang#5852
StafaH merged 4 commits into
isaac-sim:developfrom
StafaH:mh/multigpu_docs_faq

Conversation

@StafaH

@StafaH StafaH commented May 29, 2026

Copy link
Copy Markdown
Contributor

Description

Update to docs to highlight how to solve a potential multigpu training issue (not a bug but a known issue)

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@github-actions github-actions Bot added documentation Improvements or additions to documentation isaac-lab Related to Isaac Lab team labels May 29, 2026

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Isaac Lab Review Bot

Thanks for adding this troubleshooting guide, @StafaH!

Review Summary:

Technical accuracy — The explanation of NCCL P2P transport behavior with CUDA_VISIBLE_DEVICES is correct. The workaround (NCCL_P2P_DISABLE=1) is the standard fix for this known issue.

RST formatting — Proper use of section anchors, code blocks, and note directives. The heading level (^^^) correctly nests under the Troubleshooting section.

Placement — Good location in multi_gpu.rst alongside other GPU troubleshooting content, right before the Multi-Node section.

Changelog — Fragment follows the correct format in changelog.d/.

Minor observation:
The note at the end appropriately warns about the bandwidth tradeoff — helpful for users to understand when not to apply this workaround universally.

LGTM! 🚀


Update (cc2bd58): Nice simplification! The changes consolidate the CUDA_VISIBLE_DEVICES hang workaround directly into the existing NCCL troubleshooting flow rather than a separate subsection. The detailed "Why this is only needed" explanation has been trimmed, and the bandwidth tradeoff warning is now merged into a single unified note. This makes the doc more concise while preserving the essential information. Still LGTM! ✅


Update (dbb3a90): Base branch sync only — no changes to the PR's documentation files. Review status unchanged. ✅

@greptile-apps

greptile-apps Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds documentation to the multi-GPU troubleshooting section explaining that restricting visible GPUs via CUDA_VISIBLE_DEVICES can cause a silent training hang, and that NCCL_P2P_DISABLE=1 resolves it. A changelog fragment is also included.

  • Extends docs/source/features/multi_gpu.rst with a new paragraph, code block, and updated note covering the CUDA_VISIBLE_DEVICES-induced hang and the NCCL_P2P_DISABLE=1 workaround with its bandwidth trade-off.
  • Adds source/isaaclab/changelog.d/mh-multigpu-nccl-p2p-docs.rst as the corresponding changelog entry.

Confidence Score: 5/5

Documentation-only change with accurate, well-scoped troubleshooting guidance — no code is modified.

Both changed files are documentation. The new NCCL_P2P_DISABLE=1 workaround is technically correct (it disables P2P transport and routes through host memory), the performance caveat is accurately communicated in the note, and the changelog fragment matches what was added. No logic, APIs, or runtime behaviour is touched.

No files require special attention.

Important Files Changed

Filename Overview
docs/source/features/multi_gpu.rst Adds a new paragraph and code block to the NCCL troubleshooting section documenting the CUDA_VISIBLE_DEVICES subset hang and the NCCL_P2P_DISABLE=1 workaround; also expands the closing note with a performance caveat.
source/isaaclab/changelog.d/mh-multigpu-nccl-p2p-docs.rst New changelog fragment accurately describing the added troubleshooting documentation for the NCCL P2P hang.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Start Multi-GPU Training] --> B{CUDA_VISIBLE_DEVICES\nset to a subset?}
    B -- No --> C[Training proceeds normally]
    B -- Yes --> D{Training hangs\nwith no error?}
    D -- No --> C
    D -- Yes --> E[Set NCCL_P2P_DISABLE=1]
    E --> F[Relaunch distributed\ntraining command]
    F --> G[Training uses host/shared\nmemory instead of P2P]
    G --> H[Hang resolved\n⚠ Reduced bandwidth]

    style E fill:#f9c74f,stroke:#f8961e
    style H fill:#90be6d,stroke:#43aa8b
Loading

Reviews (2): Last reviewed commit: "Merge branch 'develop' into mh/multigpu_..." | Re-trigger Greptile

@StafaH StafaH closed this May 29, 2026
@StafaH StafaH deleted the mh/multigpu_docs_faq branch May 29, 2026 20:13
@StafaH StafaH restored the mh/multigpu_docs_faq branch May 29, 2026 20:13
@StafaH StafaH reopened this May 29, 2026
@StafaH StafaH merged commit 73c4f34 into isaac-sim:develop May 30, 2026
60 of 61 checks passed
@StafaH StafaH deleted the mh/multigpu_docs_faq branch May 30, 2026 00:16
kellyguo11 added a commit that referenced this pull request May 30, 2026
# Description

Cherry pick bug fixes from develop:

- #5838 
- #5852 
- #5869

---------

Co-authored-by: Antoine RICHARD <antoiner@nvidia.com>
Co-authored-by: Mustafa H <34825877+StafaH@users.noreply.github.com>
Co-authored-by: Frank Lai NV <frlai@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants