Skip to content

Conversation

@OutisLi
Copy link
Collaborator

@OutisLi OutisLi commented Jul 28, 2025

This pull request modifies the training loop to improve the quality and readability of the reported training loss.

Summary of Changes

Previously, the training loss and associated metrics (e.g., rmse_e_trn) reported in lcurve.out and the console log at each disp_freq step represented the instantaneous value from that single training batch. This could be quite noisy and subject to high variance depending on the specific batch sampled.

This PR introduces an accumulator for the training loss. The key changes are:

During each training step, the loss values are accumulated.

When a display step is reached, the accumulated values are averaged over the number of steps in that interval.

This averaged loss is then reported in the log and lcurve.out.

The accumulators are reset for the next interval.

The validation logic remains unchanged, continuing to provide a periodic snapshot of model performance, which is the standard and efficient approach.

Significance and Benefits

Reporting the averaged training loss provides a much smoother and more representative training curve. The benefits include:

Reduced Noise: Eliminates high-frequency fluctuations, making it easier to see the true learning trend.

Improved Readability: Plotted learning curves from lcurve.out are cleaner and more interpretable.

Better Comparability: Simplifies the comparison of model performance across different training runs, as the impact of single-batch anomalies is minimized.

A Note on Formatting

Please note that due to automatic code formatters (e.g., black, isort), some minor, purely stylistic changes may appear in the diff that are not directly related to the core logic.

Summary by CodeRabbit

  • New Features

    • Training loss values displayed during training are now averaged over the display interval, providing more stable and representative loss metrics for both single-task and multi-task modes.
    • Added an option to enable or disable averaging of training loss display via a new configuration setting.
  • Improvements

    • Enhanced training loss reporting for improved monitoring and analysis during model training.

@OutisLi OutisLi changed the base branch from master to devel July 28, 2025 10:04
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 28, 2025

📝 Walkthrough

Walkthrough

The code introduces new mechanisms in the training process to accumulate and average training losses over each display interval, for both single-task and multi-task scenarios. New instance variables are added to the Trainer class to track accumulated losses and step counts, ensuring that loss reporting reflects averaged values over the interval rather than instantaneous values. Additionally, a new boolean argument disp_avg is added to the training configuration to enable this feature.

Changes

Cohort / File(s) Change Summary
Loss Accumulation and Averaging Enhancements
deepmd/pt/train/training.py
Added instance variables for loss accumulation and step counting; modified training loop to accumulate and average losses over display intervals for both single-task and multi-task modes; updated logging to use averaged losses; initialized and reset accumulators appropriately.
Training Configuration Argument Update
deepmd/utils/argcheck.py
Added new optional boolean argument disp_avg to the training configuration arguments with default False and documentation describing its purpose to control averaging of training losses over display intervals.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15 minutes

Suggested reviewers

  • caic99
  • njzjz

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4797c00 and fa79d45.

📒 Files selected for processing (1)
  • deepmd/utils/argcheck.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • deepmd/utils/argcheck.py
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
deepmd/pt/train/training.py (1)

1041-1050: Initialize accumulators with explicit dict comprehension for clarity.

The initialization logic is correct. Consider using dict comprehension for better readability:

-            self.step_count_per_task = dict.fromkeys(self.model_keys, 0)
+            self.step_count_per_task = {key: 0 for key in self.model_keys}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c41b0dd and 27d22ea.

📒 Files selected for processing (1)
  • deepmd/pt/train/training.py (6 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4226
File: deepmd/dpmodel/atomic_model/base_atomic_model.py:202-202
Timestamp: 2024-10-16T21:49:57.401Z
Learning: When reviewing PRs, avoid making refactor suggestions that are not directly related to the PR's changes. For example, in `deepmd/dpmodel/atomic_model/base_atomic_model.py`, do not suggest simplifying `for kk in ret_dict.keys()` to `for kk in ret_dict` unless it's relevant to the PR.
Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4144
File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
Timestamp: 2024-10-08T15:32:11.479Z
Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905.
Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4144
File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
Timestamp: 2024-09-19T04:25:12.408Z
Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905.
🔇 Additional comments (5)
deepmd/pt/train/training.py (5)

153-158: LGTM! Instance variables properly initialized for loss accumulation.

The new instance variables are appropriately initialized to track loss accumulation and step counts for both single-task and multi-task scenarios.


817-842: LGTM! Loss accumulation logic correctly implemented.

The implementation properly accumulates losses during training steps for both single-task and multi-task scenarios. The filtering of "l2_" items ensures that regularization terms are excluded from the averaged metrics, which is appropriate for training loss reporting.


850-870: LGTM! Averaging logic correctly computes mean losses.

The implementation properly computes average losses by dividing accumulated values by the appropriate step counts. For multi-task training, using task-specific step counts ensures accurate averages even when tasks are sampled with different frequencies.


928-933: LGTM! Efficient multi-task loss reporting using accumulated averages.

The change to use log_loss_train for all tasks eliminates redundant loss recomputation and ensures consistent averaged loss reporting across all tasks. This is more efficient than the previous approach.


955-968: LGTM! Comprehensive reset logic for next accumulation interval.

The reset logic properly clears all accumulated values and step counts after displaying results, ensuring clean accumulation for the next interval. Keeping the dictionary structure intact (resetting to 0 rather than deleting keys) is a good practice for performance.

Copy link
Member

@njzjz njzjz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it seems to be useful, it is a breaking change that may change the default behavior. It is better to give an option to users.

OutisLi and others added 2 commits July 31, 2025 17:18
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
deepmd/pt/train/training.py (1)

1060-1068: LGTM! Proper initialization of accumulators.

The initialization logic correctly sets up the required data structures for loss accumulation. The use of dict.fromkeys(self.model_keys, 0) for step counters is fine since integers are immutable, though it might be clearer to use a dict comprehension for consistency.

Consider using a dict comprehension for clarity:

-                self.step_count_per_task = dict.fromkeys(self.model_keys, 0)
+                self.step_count_per_task = {key: 0 for key in self.model_keys}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 27d22ea and 8a9a161.

📒 Files selected for processing (2)
  • deepmd/pt/train/training.py (5 hunks)
  • deepmd/utils/argcheck.py (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4226
File: deepmd/dpmodel/atomic_model/base_atomic_model.py:202-202
Timestamp: 2024-10-16T21:49:57.401Z
Learning: When reviewing PRs, avoid making refactor suggestions that are not directly related to the PR's changes. For example, in `deepmd/dpmodel/atomic_model/base_atomic_model.py`, do not suggest simplifying `for kk in ret_dict.keys()` to `for kk in ret_dict` unless it's relevant to the PR.
Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4144
File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
Timestamp: 2024-09-19T04:25:12.408Z
Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905.
Learnt from: njzjz
PR: deepmodeling/deepmd-kit#4144
File: source/api_cc/tests/test_deeppot_dpa_pt.cc:166-246
Timestamp: 2024-10-08T15:32:11.479Z
Learning: Refactoring between test classes `TestInferDeepPotDpaPt` and `TestInferDeepPotDpaPtNopbc` is addressed in PR #3905.
🔇 Additional comments (6)
deepmd/utils/argcheck.py (2)

3140-3140: LGTM! Well-documented configuration parameter.

The addition of doc_disp_avg provides clear documentation for the new feature, explaining its purpose for displaying averaged training loss over display intervals.


3217-3219: LGTM! Proper argument configuration.

The disp_avg argument is correctly configured with:

  • Appropriate boolean type
  • Optional parameter with sensible default (False)
  • Clear documentation reference
  • Consistent naming convention

This follows the established patterns in the codebase for configuration arguments.

deepmd/pt/train/training.py (4)

143-143: LGTM!

The initialization of disp_avg parameter follows the established pattern and maintains backward compatibility with the default value of False.


812-838: LGTM! Well-structured loss accumulation logic.

The implementation correctly handles both single-task and multi-task scenarios with proper lazy initialization and step counting. The exclusion of L2 regularization terms from accumulation is appropriate for display purposes.


932-950: LGTM! Note the performance trade-off.

The implementation correctly handles different logging strategies for multi-task training. When disp_avg=True, it efficiently uses accumulated averages for all tasks, avoiding expensive additional forward passes. When disp_avg=False, it maintains the original behavior of computing instantaneous losses for all tasks.


972-985: LGTM! Comprehensive accumulator reset logic.

The reset logic properly handles both single-task and multi-task scenarios, ensuring that accumulators and step counters are reset after each display interval. This is essential for correct averaging behavior.

@OutisLi
Copy link
Collaborator Author

OutisLi commented Jul 31, 2025

Although it seems to be useful, it is a breaking change that may change the default behavior. It is better to give an option to users.

I added a parameter disp_avg to fix this confusion, and the default behavior is false, which is the same as before.

@njzjz njzjz requested a review from iProzd July 31, 2025 10:27
@codecov
Copy link

codecov bot commented Jul 31, 2025

Codecov Report

❌ Patch coverage is 30.30303% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.72%. Comparing base (c41b0dd) to head (fa79d45).
⚠️ Report is 81 commits behind head on devel.

Files with missing lines Patch % Lines
deepmd/pt/train/training.py 29.23% 46 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            devel    #4850   +/-   ##
=======================================
  Coverage   84.71%   84.72%           
=======================================
  Files         699      699           
  Lines       68147    68131   -16     
  Branches     3541     3541           
=======================================
- Hits        57733    57721   -12     
+ Misses       9279     9276    -3     
+ Partials     1135     1134    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

OutisLi and others added 2 commits August 1, 2025 15:09
Co-authored-by: Jinzhe Zeng <njzjz@qq.com>
Signed-off-by: LI TIANCHENG <137472077+OutisLi@users.noreply.github.com>
@njzjz njzjz closed this Aug 1, 2025
@njzjz njzjz reopened this Aug 1, 2025
@njzjz njzjz enabled auto-merge August 1, 2025 17:34
@njzjz njzjz added this pull request to the merge queue Aug 2, 2025
Merged via the queue into deepmodeling:devel with commit 813bbc8 Aug 2, 2025
104 of 106 checks passed
@OutisLi OutisLi deleted the avg_display branch September 17, 2025 08:09
ChiahsinChu pushed a commit to ChiahsinChu/deepmd-kit that referenced this pull request Dec 17, 2025
…sentative logging (deepmodeling#4850)

This pull request modifies the training loop to improve the quality and
readability of the reported training loss.

## Summary of Changes

Previously, the training loss and associated metrics (e.g., rmse_e_trn)
reported in lcurve.out and the console log at each disp_freq step
represented the instantaneous value from that single training batch.
This could be quite noisy and subject to high variance depending on the
specific batch sampled.

This PR introduces an accumulator for the training loss. The key changes
are:

During each training step, the loss values are accumulated.

When a display step is reached, the accumulated values are averaged over
the number of steps in that interval.

This averaged loss is then reported in the log and lcurve.out.

The accumulators are reset for the next interval.

The validation logic remains unchanged, continuing to provide a periodic
snapshot of model performance, which is the standard and efficient
approach.

## Significance and Benefits

Reporting the averaged training loss provides a much smoother and more
representative training curve. The benefits include:

Reduced Noise: Eliminates high-frequency fluctuations, making it easier
to see the true learning trend.

Improved Readability: Plotted learning curves from lcurve.out are
cleaner and more interpretable.

Better Comparability: Simplifies the comparison of model performance
across different training runs, as the impact of single-batch anomalies
is minimized.

## A Note on Formatting

Please note that due to automatic code formatters (e.g., black, isort),
some minor, purely stylistic changes may appear in the diff that are not
directly related to the core logic.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Training loss values displayed during training are now averaged over
the display interval, providing more stable and representative loss
metrics for both single-task and multi-task modes.
* Added an option to enable or disable averaging of training loss
display via a new configuration setting.

* **Improvements**
* Enhanced training loss reporting for improved monitoring and analysis
during model training.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: LI TIANCHENG <137472077+OutisLi@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jinzhe Zeng <njzjz@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants