refactor(training): Average training loss for smoother and more representative logging #4850

OutisLi · 2025-07-28T10:03:48Z

This pull request modifies the training loop to improve the quality and readability of the reported training loss.

Summary of Changes

Previously, the training loss and associated metrics (e.g., rmse_e_trn) reported in lcurve.out and the console log at each disp_freq step represented the instantaneous value from that single training batch. This could be quite noisy and subject to high variance depending on the specific batch sampled.

This PR introduces an accumulator for the training loss. The key changes are:

During each training step, the loss values are accumulated.

When a display step is reached, the accumulated values are averaged over the number of steps in that interval.

This averaged loss is then reported in the log and lcurve.out.

The accumulators are reset for the next interval.

The validation logic remains unchanged, continuing to provide a periodic snapshot of model performance, which is the standard and efficient approach.

Significance and Benefits

Reporting the averaged training loss provides a much smoother and more representative training curve. The benefits include:

Reduced Noise: Eliminates high-frequency fluctuations, making it easier to see the true learning trend.

Improved Readability: Plotted learning curves from lcurve.out are cleaner and more interpretable.

Better Comparability: Simplifies the comparison of model performance across different training runs, as the impact of single-batch anomalies is minimized.

A Note on Formatting

Please note that due to automatic code formatters (e.g., black, isort), some minor, purely stylistic changes may appear in the diff that are not directly related to the core logic.

Summary by CodeRabbit

New Features
- Training loss values displayed during training are now averaged over the display interval, providing more stable and representative loss metrics for both single-task and multi-task modes.
- Added an option to enable or disable averaging of training loss display via a new configuration setting.
Improvements
- Enhanced training loss reporting for improved monitoring and analysis during model training.

for more information, see https://pre-commit.ci

coderabbitai · 2025-07-28T10:09:53Z

📝 Walkthrough

Walkthrough

The code introduces new mechanisms in the training process to accumulate and average training losses over each display interval, for both single-task and multi-task scenarios. New instance variables are added to the Trainer class to track accumulated losses and step counts, ensuring that loss reporting reflects averaged values over the interval rather than instantaneous values. Additionally, a new boolean argument disp_avg is added to the training configuration to enable this feature.

Changes

Cohort / File(s)	Change Summary
Loss Accumulation and Averaging Enhancements `deepmd/pt/train/training.py`	Added instance variables for loss accumulation and step counting; modified training loop to accumulate and average losses over display intervals for both single-task and multi-task modes; updated logging to use averaged losses; initialized and reset accumulators appropriately.
Training Configuration Argument Update `deepmd/utils/argcheck.py`	Added new optional boolean argument `disp_avg` to the training configuration arguments with default `False` and documentation describing its purpose to control averaging of training losses over display intervals.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15 minutes

Suggested reviewers

caic99
njzjz

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4797c00 and fa79d45.

📒 Files selected for processing (1)

deepmd/utils/argcheck.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

deepmd/utils/argcheck.py

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

deepmd/pt/train/training.py (1)
1041-1050: Initialize accumulators with explicit dict comprehension for clarity.

The initialization logic is correct. Consider using dict comprehension for better readability:
-            self.step_count_per_task = dict.fromkeys(self.model_keys, 0)
+            self.step_count_per_task = {key: 0 for key in self.model_keys}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c41b0dd and 27d22ea.

📒 Files selected for processing (1)

deepmd/pt/train/training.py (6 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4226
File: deepmd/dpmodel/atomic_model/base_atomic_model.py:202-202
Timestamp: 2024-10-16T21:49:57.401Z
Learning: When reviewing PRs, avoid making refactor suggestions that are not directly related to the PR's changes. For example, in `deepmd/dpmodel/atomic_model/base_atomic_model.py`, do not suggest simplifying `for kk in ret_dict.keys()` to `for kk in ret_dict` unless it's relevant to the PR.

Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4144
File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
Timestamp: 2024-10-08T15:32:11.479Z
Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905.

Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4144
File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
Timestamp: 2024-09-19T04:25:12.408Z
Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905.

🔇 Additional comments (5)

deepmd/pt/train/training.py (5)

153-158: LGTM! Instance variables properly initialized for loss accumulation.

The new instance variables are appropriately initialized to track loss accumulation and step counts for both single-task and multi-task scenarios.

817-842: LGTM! Loss accumulation logic correctly implemented.

The implementation properly accumulates losses during training steps for both single-task and multi-task scenarios. The filtering of "l2_" items ensures that regularization terms are excluded from the averaged metrics, which is appropriate for training loss reporting.

850-870: LGTM! Averaging logic correctly computes mean losses.

The implementation properly computes average losses by dividing accumulated values by the appropriate step counts. For multi-task training, using task-specific step counts ensures accurate averages even when tasks are sampled with different frequencies.

928-933: LGTM! Efficient multi-task loss reporting using accumulated averages.

The change to use log_loss_train for all tasks eliminates redundant loss recomputation and ensures consistent averaged loss reporting across all tasks. This is more efficient than the previous approach.

955-968: LGTM! Comprehensive reset logic for next accumulation interval.

The reset logic properly clears all accumulated values and step counts after displaying results, ensuring clean accumulation for the next interval. Keeping the dictionary structure intact (resetting to 0 rather than deleting keys) is a good practice for performance.

njzjz

Although it seems to be useful, it is a breaking change that may change the default behavior. It is better to give an option to users.

…he default value is false, which displays just like before

for more information, see https://pre-commit.ci

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

deepmd/pt/train/training.py (1)
1060-1068: LGTM! Proper initialization of accumulators.

The initialization logic correctly sets up the required data structures for loss accumulation. The use of dict.fromkeys(self.model_keys, 0) for step counters is fine since integers are immutable, though it might be clearer to use a dict comprehension for consistency.

Consider using a dict comprehension for clarity:
-                self.step_count_per_task = dict.fromkeys(self.model_keys, 0)
+                self.step_count_per_task = {key: 0 for key in self.model_keys}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 27d22ea and 8a9a161.

📒 Files selected for processing (2)

deepmd/pt/train/training.py (5 hunks)
deepmd/utils/argcheck.py (2 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4226
File: deepmd/dpmodel/atomic_model/base_atomic_model.py:202-202
Timestamp: 2024-10-16T21:49:57.401Z
Learning: When reviewing PRs, avoid making refactor suggestions that are not directly related to the PR's changes. For example, in `deepmd/dpmodel/atomic_model/base_atomic_model.py`, do not suggest simplifying `for kk in ret_dict.keys()` to `for kk in ret_dict` unless it's relevant to the PR.

Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4144
File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
Timestamp: 2024-09-19T04:25:12.408Z
Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905.

Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4144
File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
Timestamp: 2024-10-08T15:32:11.479Z
Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905.

🔇 Additional comments (6)

deepmd/utils/argcheck.py (2)

3140-3140: LGTM! Well-documented configuration parameter.

The addition of doc_disp_avg provides clear documentation for the new feature, explaining its purpose for displaying averaged training loss over display intervals.

3217-3219: LGTM! Proper argument configuration.

The disp_avg argument is correctly configured with:

Appropriate boolean type

Optional parameter with sensible default (False)

Clear documentation reference

Consistent naming convention

This follows the established patterns in the codebase for configuration arguments.

deepmd/pt/train/training.py (4)

143-143: LGTM!

The initialization of disp_avg parameter follows the established pattern and maintains backward compatibility with the default value of False.

812-838: LGTM! Well-structured loss accumulation logic.

The implementation correctly handles both single-task and multi-task scenarios with proper lazy initialization and step counting. The exclusion of L2 regularization terms from accumulation is appropriate for display purposes.

932-950: LGTM! Note the performance trade-off.

The implementation correctly handles different logging strategies for multi-task training. When disp_avg=True, it efficiently uses accumulated averages for all tasks, avoiding expensive additional forward passes. When disp_avg=False, it maintains the original behavior of computing instantaneous losses for all tasks.

972-985: LGTM! Comprehensive accumulator reset logic.

The reset logic properly handles both single-task and multi-task scenarios, ensuring that accumulators and step counters are reset after each display interval. This is essential for correct averaging behavior.

deepmd/pt/train/training.py

OutisLi · 2025-07-31T09:26:44Z

Although it seems to be useful, it is a breaking change that may change the default behavior. It is better to give an option to users.

I added a parameter disp_avg to fix this confusion, and the default behavior is false, which is the same as before.

deepmd/utils/argcheck.py

codecov · 2025-07-31T10:37:26Z

Codecov Report

❌ Patch coverage is 30.30303% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.72%. Comparing base (c41b0dd) to head (fa79d45).
⚠️ Report is 81 commits behind head on devel.

Files with missing lines	Patch %	Lines
deepmd/pt/train/training.py	29.23%	46 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##            devel    #4850   +/-   ##
=======================================
  Coverage   84.71%   84.72%           
=======================================
  Files         699      699           
  Lines       68147    68131   -16     
  Branches     3541     3541           
=======================================
- Hits        57733    57721   -12     
+ Misses       9279     9276    -3     
+ Partials     1135     1134    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-authored-by: Jinzhe Zeng <njzjz@qq.com> Signed-off-by: LI TIANCHENG <137472077+OutisLi@users.noreply.github.com>

for more information, see https://pre-commit.ci

…sentative logging (deepmodeling#4850) This pull request modifies the training loop to improve the quality and readability of the reported training loss. ## Summary of Changes Previously, the training loss and associated metrics (e.g., rmse_e_trn) reported in lcurve.out and the console log at each disp_freq step represented the instantaneous value from that single training batch. This could be quite noisy and subject to high variance depending on the specific batch sampled. This PR introduces an accumulator for the training loss. The key changes are: During each training step, the loss values are accumulated. When a display step is reached, the accumulated values are averaged over the number of steps in that interval. This averaged loss is then reported in the log and lcurve.out. The accumulators are reset for the next interval. The validation logic remains unchanged, continuing to provide a periodic snapshot of model performance, which is the standard and efficient approach. ## Significance and Benefits Reporting the averaged training loss provides a much smoother and more representative training curve. The benefits include: Reduced Noise: Eliminates high-frequency fluctuations, making it easier to see the true learning trend. Improved Readability: Plotted learning curves from lcurve.out are cleaner and more interpretable. Better Comparability: Simplifies the comparison of model performance across different training runs, as the impact of single-batch anomalies is minimized. ## A Note on Formatting Please note that due to automatic code formatters (e.g., black, isort), some minor, purely stylistic changes may appear in the diff that are not directly related to the core logic.  ## Summary by CodeRabbit * **New Features** * Training loss values displayed during training are now averaged over the display interval, providing more stable and representative loss metrics for both single-task and multi-task modes. * Added an option to enable or disable averaging of training loss display via a new configuration setting. * **Improvements** * Enhanced training loss reporting for improved monitoring and analysis during model training.  --------- Signed-off-by: LI TIANCHENG <137472077+OutisLi@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jinzhe Zeng <njzjz@qq.com>

display accumulated loss for trainning otherthan single batch loss

31e4312

github-actions bot added Python C++ Docs labels Jul 28, 2025

OutisLi changed the base branch from master to devel July 28, 2025 10:04

[pre-commit.ci] auto fixes from pre-commit.com hooks

27d22ea

for more information, see https://pre-commit.ci

coderabbitai bot reviewed Jul 28, 2025

View reviewed changes

njzjz reviewed Jul 28, 2025

View reviewed changes

OutisLi and others added 2 commits July 31, 2025 17:18

fix:add a parameter disp_avg in input file to allow user to choose, t…

8a9a161

…he default value is false, which displays just like before

[pre-commit.ci] auto fixes from pre-commit.com hooks

4797c00

for more information, see https://pre-commit.ci

coderabbitai bot reviewed Jul 31, 2025

View reviewed changes

deepmd/pt/train/training.py Show resolved Hide resolved

njzjz reviewed Jul 31, 2025

View reviewed changes

deepmd/utils/argcheck.py Outdated Show resolved Hide resolved

njzjz requested a review from iProzd July 31, 2025 10:27

OutisLi and others added 2 commits August 1, 2025 15:09

doc: specify pytorch backend

45d62da

Co-authored-by: Jinzhe Zeng <njzjz@qq.com> Signed-off-by: LI TIANCHENG <137472077+OutisLi@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

fa79d45

for more information, see https://pre-commit.ci

iProzd approved these changes Aug 1, 2025

View reviewed changes

njzjz closed this Aug 1, 2025

njzjz reopened this Aug 1, 2025

njzjz approved these changes Aug 1, 2025

View reviewed changes

njzjz enabled auto-merge August 1, 2025 17:34

njzjz added this pull request to the merge queue Aug 2, 2025

Merged via the queue into deepmodeling:devel with commit 813bbc8 Aug 2, 2025
104 of 106 checks passed

OutisLi deleted the avg_display branch September 17, 2025 08:09

refactor(training): Average training loss for smoother and more representative logging #4850

refactor(training): Average training loss for smoother and more representative logging #4850

Uh oh!

Conversation

OutisLi commented Jul 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of Changes

Significance and Benefits

A Note on Formatting

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

njzjz left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

OutisLi commented Jul 31, 2025

Uh oh!

Uh oh!

codecov bot commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

OutisLi commented Jul 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 28, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Jul 31, 2025 •

edited

Loading