-
Notifications
You must be signed in to change notification settings - Fork 584
refactor(training): Average training loss for smoother and more representative logging #4850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
for more information, see https://pre-commit.ci
📝 WalkthroughWalkthroughThe code introduces new mechanisms in the training process to accumulate and average training losses over each display interval, for both single-task and multi-task scenarios. New instance variables are added to the Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~15 minutes Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
deepmd/pt/train/training.py (1)
1041-1050: Initialize accumulators with explicit dict comprehension for clarity.The initialization logic is correct. Consider using dict comprehension for better readability:
- self.step_count_per_task = dict.fromkeys(self.model_keys, 0) + self.step_count_per_task = {key: 0 for key in self.model_keys}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/pt/train/training.py(6 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4226
File: deepmd/dpmodel/atomic_model/base_atomic_model.py:202-202
Timestamp: 2024-10-16T21:49:57.401Z
Learning: When reviewing PRs, avoid making refactor suggestions that are not directly related to the PR's changes. For example, in `deepmd/dpmodel/atomic_model/base_atomic_model.py`, do not suggest simplifying `for kk in ret_dict.keys()` to `for kk in ret_dict` unless it's relevant to the PR.
Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4144
File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
Timestamp: 2024-10-08T15:32:11.479Z
Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905.
Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4144
File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
Timestamp: 2024-09-19T04:25:12.408Z
Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905.
🔇 Additional comments (5)
deepmd/pt/train/training.py (5)
153-158: LGTM! Instance variables properly initialized for loss accumulation.The new instance variables are appropriately initialized to track loss accumulation and step counts for both single-task and multi-task scenarios.
817-842: LGTM! Loss accumulation logic correctly implemented.The implementation properly accumulates losses during training steps for both single-task and multi-task scenarios. The filtering of "l2_" items ensures that regularization terms are excluded from the averaged metrics, which is appropriate for training loss reporting.
850-870: LGTM! Averaging logic correctly computes mean losses.The implementation properly computes average losses by dividing accumulated values by the appropriate step counts. For multi-task training, using task-specific step counts ensures accurate averages even when tasks are sampled with different frequencies.
928-933: LGTM! Efficient multi-task loss reporting using accumulated averages.The change to use
log_loss_trainfor all tasks eliminates redundant loss recomputation and ensures consistent averaged loss reporting across all tasks. This is more efficient than the previous approach.
955-968: LGTM! Comprehensive reset logic for next accumulation interval.The reset logic properly clears all accumulated values and step counts after displaying results, ensuring clean accumulation for the next interval. Keeping the dictionary structure intact (resetting to 0 rather than deleting keys) is a good practice for performance.
njzjz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it seems to be useful, it is a breaking change that may change the default behavior. It is better to give an option to users.
…he default value is false, which displays just like before
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
deepmd/pt/train/training.py (1)
1060-1068: LGTM! Proper initialization of accumulators.The initialization logic correctly sets up the required data structures for loss accumulation. The use of
dict.fromkeys(self.model_keys, 0)for step counters is fine since integers are immutable, though it might be clearer to use a dict comprehension for consistency.Consider using a dict comprehension for clarity:
- self.step_count_per_task = dict.fromkeys(self.model_keys, 0) + self.step_count_per_task = {key: 0 for key in self.model_keys}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deepmd/pt/train/training.py(5 hunks)deepmd/utils/argcheck.py(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4226
File: deepmd/dpmodel/atomic_model/base_atomic_model.py:202-202
Timestamp: 2024-10-16T21:49:57.401Z
Learning: When reviewing PRs, avoid making refactor suggestions that are not directly related to the PR's changes. For example, in `deepmd/dpmodel/atomic_model/base_atomic_model.py`, do not suggest simplifying `for kk in ret_dict.keys()` to `for kk in ret_dict` unless it's relevant to the PR.
Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4144
File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
Timestamp: 2024-09-19T04:25:12.408Z
Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905.
Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4144
File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
Timestamp: 2024-10-08T15:32:11.479Z
Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905.
🔇 Additional comments (6)
deepmd/utils/argcheck.py (2)
3140-3140: LGTM! Well-documented configuration parameter.The addition of
doc_disp_avgprovides clear documentation for the new feature, explaining its purpose for displaying averaged training loss over display intervals.
3217-3219: LGTM! Proper argument configuration.The
disp_avgargument is correctly configured with:
- Appropriate boolean type
- Optional parameter with sensible default (
False)- Clear documentation reference
- Consistent naming convention
This follows the established patterns in the codebase for configuration arguments.
deepmd/pt/train/training.py (4)
143-143: LGTM!The initialization of
disp_avgparameter follows the established pattern and maintains backward compatibility with the default value ofFalse.
812-838: LGTM! Well-structured loss accumulation logic.The implementation correctly handles both single-task and multi-task scenarios with proper lazy initialization and step counting. The exclusion of L2 regularization terms from accumulation is appropriate for display purposes.
932-950: LGTM! Note the performance trade-off.The implementation correctly handles different logging strategies for multi-task training. When
disp_avg=True, it efficiently uses accumulated averages for all tasks, avoiding expensive additional forward passes. Whendisp_avg=False, it maintains the original behavior of computing instantaneous losses for all tasks.
972-985: LGTM! Comprehensive accumulator reset logic.The reset logic properly handles both single-task and multi-task scenarios, ensuring that accumulators and step counters are reset after each display interval. This is essential for correct averaging behavior.
I added a parameter |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## devel #4850 +/- ##
=======================================
Coverage 84.71% 84.72%
=======================================
Files 699 699
Lines 68147 68131 -16
Branches 3541 3541
=======================================
- Hits 57733 57721 -12
+ Misses 9279 9276 -3
+ Partials 1135 1134 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-authored-by: Jinzhe Zeng <njzjz@qq.com> Signed-off-by: LI TIANCHENG <137472077+OutisLi@users.noreply.github.com>
for more information, see https://pre-commit.ci
813bbc8
…sentative logging (deepmodeling#4850) This pull request modifies the training loop to improve the quality and readability of the reported training loss. ## Summary of Changes Previously, the training loss and associated metrics (e.g., rmse_e_trn) reported in lcurve.out and the console log at each disp_freq step represented the instantaneous value from that single training batch. This could be quite noisy and subject to high variance depending on the specific batch sampled. This PR introduces an accumulator for the training loss. The key changes are: During each training step, the loss values are accumulated. When a display step is reached, the accumulated values are averaged over the number of steps in that interval. This averaged loss is then reported in the log and lcurve.out. The accumulators are reset for the next interval. The validation logic remains unchanged, continuing to provide a periodic snapshot of model performance, which is the standard and efficient approach. ## Significance and Benefits Reporting the averaged training loss provides a much smoother and more representative training curve. The benefits include: Reduced Noise: Eliminates high-frequency fluctuations, making it easier to see the true learning trend. Improved Readability: Plotted learning curves from lcurve.out are cleaner and more interpretable. Better Comparability: Simplifies the comparison of model performance across different training runs, as the impact of single-batch anomalies is minimized. ## A Note on Formatting Please note that due to automatic code formatters (e.g., black, isort), some minor, purely stylistic changes may appear in the diff that are not directly related to the core logic. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Training loss values displayed during training are now averaged over the display interval, providing more stable and representative loss metrics for both single-task and multi-task modes. * Added an option to enable or disable averaging of training loss display via a new configuration setting. * **Improvements** * Enhanced training loss reporting for improved monitoring and analysis during model training. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: LI TIANCHENG <137472077+OutisLi@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jinzhe Zeng <njzjz@qq.com>
This pull request modifies the training loop to improve the quality and readability of the reported training loss.
Summary of Changes
Previously, the training loss and associated metrics (e.g., rmse_e_trn) reported in lcurve.out and the console log at each disp_freq step represented the instantaneous value from that single training batch. This could be quite noisy and subject to high variance depending on the specific batch sampled.
This PR introduces an accumulator for the training loss. The key changes are:
During each training step, the loss values are accumulated.
When a display step is reached, the accumulated values are averaged over the number of steps in that interval.
This averaged loss is then reported in the log and lcurve.out.
The accumulators are reset for the next interval.
The validation logic remains unchanged, continuing to provide a periodic snapshot of model performance, which is the standard and efficient approach.
Significance and Benefits
Reporting the averaged training loss provides a much smoother and more representative training curve. The benefits include:
Reduced Noise: Eliminates high-frequency fluctuations, making it easier to see the true learning trend.
Improved Readability: Plotted learning curves from lcurve.out are cleaner and more interpretable.
Better Comparability: Simplifies the comparison of model performance across different training runs, as the impact of single-batch anomalies is minimized.
A Note on Formatting
Please note that due to automatic code formatters (e.g., black, isort), some minor, purely stylistic changes may appear in the diff that are not directly related to the core logic.
Summary by CodeRabbit
New Features
Improvements