-
Notifications
You must be signed in to change notification settings - Fork 584
fix(pt): ensure proper cleanup of distributed process group #4622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR addresses a warning during normal DDP training exit by ensuring that resources allocated for the distributed process group are properly released.
- Added a conditional check and call to dist.destroy_process_group() to perform cleanup
- Ensures that resources are released in accordance with PyTorch's requirements
Reviewed Changes
| File | Description |
|---|---|
| deepmd/pt/train/training.py | Added destructor logic for distributed process group cleanup |
Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (1)
deepmd/pt/train/training.py:1034
- Ensure that calling dist.destroy_process_group() within log_loss_valid is the intended cleanup action, as this function may be invoked multiple times. If the cleanup should occur only once at training completion, consider moving this logic accordingly.
if dist.is_available() and dist.is_initialized():
📝 WalkthroughWalkthroughThe changes relocate the Distributed Data Parallel (DDP) initialization code from the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Trainer
participant TrainFunc as train()
participant DDP as Distributed Group
User->>Trainer: Call get_trainer()
Note right of Trainer: No DDP initialization here.
Trainer->>TrainFunc: Invoke training process
TrainFunc->>DDP: Initialize DDP (dist.init_process_group)
TrainFunc->>TrainFunc: Execute training operations
TrainFunc->>DDP: Cleanup DDP (dist.destroy_process_group)
Suggested labels
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (19)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
wanghan-iapcm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this is the best place to finalize the processors.
the best way would be explicitly pair init with destroy, then it seems to be better to place it at this line.
https://github.com/caic99/deepmd-kit/blob/19d0df3fa0cae2460ba7897c8133dc01e653e9bf/deepmd/pt/entrypoints/main.py#L363
Note: it do not think the init_process_group is implemented in an elegant way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/pt/entrypoints/main.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test C++ (false)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test C++ (true)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Analyze (python)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (c-cpp)
🔇 Additional comments (1)
deepmd/pt/entrypoints/main.py (1)
335-340: Good refactoring of DDP initialization logicMoving the DDP initialization from
get_trainerto thetrainfunction ensures that the process group is only initialized when actually needed for training. This is a cleaner approach that follows the principle of performing setup operations at the appropriate point in the execution flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR fixes the process group cleanup issue for distributed training to eliminate resource leakage warnings. The changes move the process group initialization from get_trainer to train and add a corresponding destroy_process_group call after training completes.
Reviewed Changes
| File | Description |
|---|---|
| deepmd/pt/entrypoints/main.py | Removed DDP initialization from get_trainer and added process group initialization and cleanup in train |
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## devel #4622 +/- ##
==========================================
- Coverage 84.58% 84.58% -0.01%
==========================================
Files 680 680
Lines 64547 64547
Branches 3539 3539
==========================================
- Hits 54600 54599 -1
Misses 8806 8806
- Partials 1141 1142 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There is a warning when DDP training exits normally:
It is required by PyTorch to cleanup resources after training to ensure a graceful exit.
This PR adds the missing destructor to eliminate the warning.
Summary by CodeRabbit