feat: add new batch size rules for large systems #4659

caic99 · 2025-03-17T07:18:30Z

This PR adds two type of defining batch sizes:
max: Unlike auto, max does not do ceiling to batch size. Consider the case: batch size is set as auto:256, and you've got a system with 255 atoms. In this case, auto will infer a batch size of 2, while max remains 1 to avoid potential OOM.
filter: It removes any systems with atoms more than expected, and limits the total atom numbers in a batch like what max does.

I've running out of idea what the keywords should be, so please leave a comment if there is a better idea.

Todo: migrate changes to Paddle backend

Summary by CodeRabbit

New Features
- Enhanced batch size configuration options now allow the use of descriptive prefixes for more flexible control, including "max:N" and "filter:N".
- Clarified the existing "auto:N" option for batch size configuration.
Bug Fixes
- Improved validation ensures that batch sizes are always at least one.
- Updated error feedback provides clear messaging for unsupported batch size formats.
Tests
- Introduced a new test suite to validate batch size behavior under various input scenarios, including assertions for different formats and conditions.

Copilot

Pull Request Overview

This PR introduces two new rules for defining batch sizes in the data loader to better handle systems with large numbers of atoms.

Added support for "max:" and "cap:" keywords to modify batch size rules.
For "cap:", systems exceeding the specified atom count are filtered out, and a warning is logged.

Comments suppressed due to low confidence (1)

deepmd/pt/utils/dataloader.py:145

[nitpick] Consider renaming variable 'ii' to 'system' for improved readability.

for ii in self.systems:

coderabbitai · 2025-03-17T07:20:43Z

📝 Walkthrough

Walkthrough

The changes enhance the handling of the batch_size parameter in the construct_dataset function within the dataloader.py file. A new variable ceiling is introduced to manage batch size calculations based on the prefixes max:, filter:, and the existing auto:. The logic now includes filtering systems based on atom counts for the filter: prefix, and error handling has been improved to raise a ValueError for unsupported types. Additionally, a new test file is introduced to validate these changes.

Changes

Files	Change Summary
deepmd/pt/utils/dataloader.py	Enhanced handling of `batch_size` prefixes `max:` and `filter:`; introduced `ceiling` variable; modified batch size index (`bsi`) calculation; updated error handling to raise a `ValueError`.
source/tests/pt/test_dploaderset.py	Added a new test class `TestSampler` with methods to validate the behavior of batch size handling, including tests for different input scenarios and expected exceptions.
doc/train/training-advanced.md	Introduced new options for `batch_size` in training configuration: clarified `"auto:N"`, added `"max:N"`, and added `"filter:N"` options for enhanced control over dataset selection.
deepmd/utils/argcheck.py	Added new string options for `batch_size` in `training_data_args`: `"max:N"` and `"filter:N"` to enhance flexibility in training data preparation.

Possibly related PRs

refactor: simplify dataset construction #4437: The changes in the main PR, which enhance the handling of the batch_size parameter in the construct_dataset function, are directly related to the modifications in the retrieved PR that also involve the dataloader.py file and the dataset construction process.

Suggested reviewers

njzjz
CaRoLZhangxy

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aa07bf1 and bb41847.

📒 Files selected for processing (1)

source/tests/pt/test_dploaderset.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

source/tests/pt/test_dploaderset.py

⏰ Context from checks skipped due to timeout of 90000ms (29)

GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Test Python (4, 3.9)
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Test Python (3, 3.12)
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build C++ (clang, clang)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Test Python (2, 3.12)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Analyze (python)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Test C++ (false)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Test C++ (true)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Analyze (c-cpp)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Build C++ (cpu, cpu)

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov · 2025-03-17T08:14:06Z

Codecov Report

Attention: Patch coverage is 50.00000% with 21 lines in your changes missing coverage. Please review.

Project coverage is 84.80%. Comparing base (4c8600a) to head (bb41847).
Report is 86 commits behind head on devel.

Files with missing lines	Patch %	Lines
deepmd/pd/utils/dataloader.py	0.00%	21 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##            devel    #4659      +/-   ##
==========================================
- Coverage   84.80%   84.80%   -0.01%     
==========================================
  Files         692      692              
  Lines       66345    66395      +50     
  Branches     3539     3538       -1     
==========================================
+ Hits        56265    56303      +38     
- Misses       8940     8951      +11     
- Partials     1140     1141       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

njzjz · 2025-03-17T08:35:35Z

I've running out of idea what the keywords should be, so please leave a comment if there is a better idea.

Have you asked DeepSeek?

caic99 · 2025-03-18T02:16:25Z

Have you asked DeepSeek?

It suggest replacing "cap:N" with "filter:N", which make sense to me. @iProzd Any comments?

wanghan-iapcm

This pr is not covered by UT.

for more information, see https://pre-commit.ci

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

source/tests/pt/test_dploaderset.py (2)
75-78: Remove debug print statement

The print(cm.output) statement on line 77 appears to be a debug statement that should be removed before merging.
        with self.assertLogs() as cm:
            self.assertRaises(AssertionError, self.get_batch_sizes, "filter:191")
-            print(cm.output)  # DEBUG
            self.assertIn("Remove 1 systems with more than 191 atoms", cm.output[-1])
46-84: Consider adding tests for multiple systems with the filter option

The current tests only verify the filter option with a single system. Since the PR mentions that the filter method excludes systems that contain more atoms than the specified limit, it would be valuable to add tests with multiple systems of varying atom counts to ensure the filtering works correctly across systems.

Consider adding a test case like this:
def test_filter_multiple_systems(self) -> None:
    # Mock multiple systems with different atom counts
    original_systems = self.systems
    try:
        # Set up test with multiple mock systems
        # Implementation depends on how systems are represented
        # This is just a conceptual example
        self.systems = [...] # Multiple systems with varying atom counts
        
        # Test filtering logic
        result = self.get_batch_sizes("filter:200")
        # Assert expected behavior
        
    finally:
        # Restore original systems
        self.systems = original_systems

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10defcb and 199d9b7.

📒 Files selected for processing (1)

source/tests/pt/test_dploaderset.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (29)

GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Test Python (4, 3.9)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Test Python (3, 3.9)
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Test Python (2, 3.9)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Analyze (python)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Test Python (1, 3.12)
GitHub Check: Analyze (c-cpp)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Test C++ (false)
GitHub Check: Test C++ (true)

🔇 Additional comments (1)

source/tests/pt/test_dploaderset.py (1)

1-87: The test coverage for the new batch size rules looks good

The test cases for the new max and filter batch size rules appear comprehensive and align well with the PR objectives. The test verifies that:

The max method maintains a batch size of 1 when atoms exceed the threshold to prevent OOM issues

The filter method correctly excludes systems that contain more atoms than the specified limit

source/tests/pt/test_dploaderset.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Chun Cai <amoycaic@gmail.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

source/tests/pt/test_dploaderset.py (2)
72-79: Clean up debugging print statement.

The test case for the "filter" batch size rule includes a debugging print statement that should be removed.
-            print(cm.output)  # DEBUG
1-88: Consider adding additional test cases for edge scenarios.

The current tests cover the basic functionality but could benefit from additional test cases:

Testing with multiple systems with different atom counts

Edge cases like very large atom thresholds

Behavior when atom counts are exactly at the threshold boundary

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 199d9b7 and 5bf8fff.

📒 Files selected for processing (1)

source/tests/pt/test_dploaderset.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (29)

GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Analyze (python)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Analyze (c-cpp)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Test C++ (false)
GitHub Check: Test C++ (true)
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp311-manylinux_x86_64

🔇 Additional comments (4)

source/tests/pt/test_dploaderset.py (4)

36-44: The function signature and implementation match the requirements.

The method correctly instantiates a DpLoaderSet and returns the first batch size as expected.

46-64: Good comprehensive testing of different batch size formats.

The test cases cover integer, list, and string-based "auto" batch sizes with various thresholds, verifying that the auto rule correctly applies a ceiling function for determining batch sizes based on the number of atoms.

65-71: Good verification of the new "max" batch size rule.

These tests confirm that the "max" rule correctly implements the behavior described in the PR objectives - not applying a ceiling to prevent potential OOM issues.

80-83: Good error handling validation for unknown batch size rule.

The test correctly validates that an unknown batch size rule raises a ValueError with the appropriate error message.

Copilot

Pull Request Overview

This PR adds new configuration options for defining batch sizes, introducing two new rules for handling large systems. The changes include updates to the batch size computation logic in the dataloader and new tests covering int, list, and string-based batch size inputs (including "auto", "max", and "filter" rules).

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
deepmd/pt/utils/dataloader.py	Implements new batch size rules and corresponding logic.
source/tests/pt/test_dploaderset.py	Adds tests verifying the new batch size behavior for various inputs.

Comments suppressed due to low confidence (1)

deepmd/pt/utils/dataloader.py:130

The PR description refers to a 'cap' rule for batch sizes, but the implementation uses 'filter:'. Consider aligning the naming between the documentation and the code for clarity.

elif batch_size.startswith("filter:"):

njzjz

Documentation needs to be updated

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

doc/train/training-advanced.md (1)

109-111: Enhance Documentation Clarity and Consistency for New Batch Size Options

The descriptions for the new "auto:N", "max:N", and "filter:N" options accurately reflect the intended logic per the PR objectives:

"auto:N" correctly specifies that the batch size times the number of atoms is no less than N.

"max:N" clearly indicates that the product is no more than N with a safeguard of a minimum batch size of 1.

"filter:N" conveys that it functions like "max:N" but excludes systems exceeding the specified atom count and raises an error if no system remains.

Suggestions:

Consider revising the phrasing of the "max:N" and "filter:N" lines to ensure parallel structure and consistent punctuation—for example, starting each bullet point with "Automatically determines..." may improve readability.

Address the static analysis hints regarding loose punctuation around these lines by reviewing the sentence endings for clarity.

🧰 Tools

🪛 LanguageTool

[uncategorized] ~110-~110: Loose punctuation mark.
Context: ...m is no less than N. - "max:N": automatically determines the batch size...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~111-~111: Loose punctuation mark.
Context: ...inimum batch size is 1. - "filter:N": the same as "max:N" but removes the s...

(UNLIKELY_OPENING_PUNCTUATION)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fa3582 and b43a4ab.

📒 Files selected for processing (1)

doc/train/training-advanced.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

doc/train/training-advanced.md

[uncategorized] ~110-~110: Loose punctuation mark.
Context: ...m is no less than N. - "max:N": automatically determines the batch size...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~111-~111: Loose punctuation mark.
Context: ...inimum batch size is 1. - "filter:N": the same as "max:N" but removes the s...

(UNLIKELY_OPENING_PUNCTUATION)

⏰ Context from checks skipped due to timeout of 90000ms (26)

GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Test Python (4, 3.9)
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Test Python (3, 3.12)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Test Python (2, 3.12)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Test Python (2, 3.9)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Test C++ (false)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Test Python (1, 3.12)
GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Test C++ (true)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Analyze (c-cpp)

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

doc/train/training-advanced.md (1)

109-111: Clarify Batch Size Option Descriptions

The descriptions for the "auto:N", "max:N", and "filter:N" options are clear and align well with the PR objectives. A couple of nitpicky suggestions to improve clarity and consistency:

For "max:N", consider adding a comma to separate the clause—for example:
Before:
"max:N": automatically determines the batch size so that ... is no more than N. The minimum batch size is 1.
After:
"max:N": automatically determines the batch size so that ... is no more than N, with a minimum batch size of 1.

For "filter:N", you might rephrase slightly for consistency. For example:
Before:
"filter:N": the same as "max:N" but removes the systems with the number of atoms larger than N from the data set. Throws an error if no system is left in a dataset.
After:
"filter:N": similar to "max:N", but it also removes systems whose atom count exceeds N from the dataset, throwing an error if no systems remain.

These adjustments address the minor loose punctuation issues flagged by static analysis.

🧰 Tools

🪛 LanguageTool

[uncategorized] ~110-~110: Loose punctuation mark.
Context: ...m is no less than N. - "max:N": automatically determines the batch size...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~111-~111: Loose punctuation mark.
Context: ...inimum batch size is 1. - "filter:N": the same as "max:N" but removes the s...

(UNLIKELY_OPENING_PUNCTUATION)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fa3582 and b43a4ab.

📒 Files selected for processing (1)

doc/train/training-advanced.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

doc/train/training-advanced.md

[uncategorized] ~110-~110: Loose punctuation mark.
Context: ...m is no less than N. - "max:N": automatically determines the batch size...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~111-~111: Loose punctuation mark.
Context: ...inimum batch size is 1. - "filter:N": the same as "max:N" but removes the s...

(UNLIKELY_OPENING_PUNCTUATION)

⏰ Context from checks skipped due to timeout of 90000ms (25)

GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Test Python (4, 3.9)
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Test Python (3, 3.12)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Test Python (2, 3.12)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Test Python (2, 3.9)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Test C++ (false)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Test Python (1, 3.12)
GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Test C++ (true)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Analyze (c-cpp)

njzjz

Please update here

deepmd-kit/deepmd/utils/argcheck.py

Lines 2828 to 2834 in bb06e4f

    
               doc_batch_size = f'This key can be \n\n\ 
        
           - list: the length of which is the same as the {link_sys}. The batch size of each system is given by the elements of the list.\n\n\ 
        
           - int: all {link_sys} use the same batch size.\n\n\ 
        
           - string "auto": automatically determines the batch size so that the batch_size times the number of atoms in the system is no less than 32.\n\n\ 
        
           - string "auto:N": automatically determines the batch size so that the batch_size times the number of atoms in the system is no less than N.\n\n\ 
        
           - string "mixed:N": the batch data will be sampled from all systems and merged into a mixed system with the batch size N. Only support the se_atten descriptor for TensorFlow backend.\n\n\ 
        
           If MPI is used, the value should be considered as the batch size per task.'

deepmd/pt/utils/dataloader.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

deepmd/pd/utils/dataloader.py (1)
127-164: Consider adding code comments explaining the ceiling behavior.

While the implementation of the ceiling variable and its usage is correct, adding a brief comment explaining the difference between ceiling=True (auto) and ceiling=False (max/filter) would improve code maintainability. This would help future developers understand the impact of this variable on batch size calculations.
 if batch_size == "auto":
     rule = 32
     ceiling = True
+    # When ceiling=True, we ensure rule/natoms rounds up to fit exactly rule atoms
 elif batch_size.startswith("auto:"):
     rule = int(batch_size.split(":")[1])
     ceiling = True
 elif batch_size.startswith("max:"):
     rule = int(batch_size.split(":")[1])
     ceiling = False
+    # When ceiling=False, we don't round up and use at most rule atoms per batch

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8a5fc80 and 281959e.

📒 Files selected for processing (1)

deepmd/pd/utils/dataloader.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (30)

GitHub Check: Build C++ (clang, clang)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Analyze (python)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Analyze (javascript-typescript)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Analyze (c-cpp)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Test Python (1, 3.12)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Test Python (1, 3.9)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Test C++ (false)
GitHub Check: Test C++ (true)

🔇 Additional comments (7)

deepmd/pd/utils/dataloader.py (7)

129-132: Implementation of ceiling variable looks good.

The code now explicitly sets ceiling = True for both auto and auto:N batch size rules, which helps clarify the behavior and sets the foundation for the new methods.

133-136: New max method implementation is well-designed.

The max: prefix implementation correctly sets ceiling = False to prevent applying a ceiling to the batch size, which aligns with the PR objective of not increasing batch sizes to prevent OOM issues.

136-153: Filter implementation provides good safeguards.

The implementation for the filter: method is thorough and includes:

Clear system filtering based on atom count

Appropriate warnings when systems are removed

Error handling when no systems remain after filtering

Setting ceiling = False consistent with the max: method

This will effectively prevent OOM issues by removing systems that exceed the atom count threshold.

154-154: Improved error message for unsupported batch size rules.

The error message now clearly indicates what type of batch size rule is unsupported, which will help users troubleshoot configuration issues.

158-164: Batch size calculation logic handles both ceiling modes correctly.

The conditional logic for calculating the batch size index (bsi) now properly handles both ceiling behaviors:

When ceiling = True: Increments bsi if there's a remainder (original behavior)

When ceiling = False: Ensures bsi is at least 1 to handle cases where rule < natoms

This implementation aligns with the intended behavior differences between auto and the new max/filter methods.

136-153:
Details

✅ Verification successful

Verify filter behavior with unit tests.

While the implementation for the filter method is robust, it would be valuable to ensure it's covered by unit tests that verify:

Systems exceeding the atom count are properly filtered

Warnings are generated when systems are removed

Errors are raised when no systems remain

🏁 Script executed:
#!/bin/bash
# Check if unit tests exist for the new filter functionality

# Look for test files that might be testing the batch size functionality
fd -t f "test.*dataloader" -E "*.pyc"

# Search for tests specifically related to the new batch size methods
rg -i "test.*batch.*size|test.*filter|test.*max.*batch" --type py
Length of output: 1423
Filter functionality verified via existing tests—please confirm complete edge-case coverage.

The current unit tests in the repository (e.g., in
• source/tests/tf/test_deepmd_data_sys.py (e.g., test_batch_size_5 and test_batch_size_raise),
• source/tests/pt/test_dploaderset.py (where comments indicate testing batch_size values like "filter")
show that filtering behavior is exercised. These tests should verify that:

Systems with atom counts exceeding the specified threshold are properly removed.

A warning is logged when one or more systems are filtered out.

A ValueError is raised when all systems are removed through filtering.

Please confirm that each of these scenarios is covered by dedicated assertions—especially checking the warning logs when systems are removed. If any of these aspects aren’t thoroughly verified, additional test cases might be needed.

133-136:
Details

❓ Verification inconclusive

Ensure compatibility with Paddle backend.

As mentioned in the PR objectives, there's a plan to migrate these changes to the Paddle backend. Since this file is already in the Paddle backend (deepmd/pd/utils/dataloader.py), verify that equivalent changes are made to any similar files in other backends for consistency.

🏁 Script executed:
#!/bin/bash
# Check if similar dataloader implementations exist that might need the same changes

# Look for other dataloader.py files
fd -t f "dataloader.py"

# Specifically check for similar batching logic in other backends
rg "batch_size.*startswith.*auto" --type py
Length of output: 287
Attention: Verify Cross-Backend Consistency for Batch Size Handling

The updated batching logic in deepmd/pd/utils/dataloader.py now processes "max:" and "filter:" prefixes as expected for the Paddle backend. However, our repository search shows that the PyTorch backend file (deepmd/pt/utils/dataloader.py) only contains logic for the "auto:" prefix. Please verify that:

If the modifications for "max:" and "filter:" are intended to be exclusive to the Paddle backend (in line with the migration objectives), then it’s acceptable that the PyTorch backend retains its current behavior.

If a unified behavior across backends is desired, similar changes should be applied to deepmd/pt/utils/dataloader.py to handle "max:" and "filter:" cases consistently.

Ensure that the final design decision and documentation are updated accordingly.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

doc/train/training-advanced.md (2)

110-110: Review "max:N" Documentation Clarity
The new "max:N" option is well-described: it specifies that the batch size should be determined so that the product with the number of atoms is no more than N, with a minimum batch size of 1 and clearly lists the supported backends. Consider revising the punctuation slightly (for example, inserting a comma or semicolon after the condition) to enhance readability.

🧰 Tools

🪛 LanguageTool

[uncategorized] ~110-~110: Loose punctuation mark.
Context: ...m is no less than N. - "max:N": automatically determines the batch size...

(UNLIKELY_OPENING_PUNCTUATION)

111-111: Review "filter:N" Documentation Details
The "filter:N" option is clearly explained—it behaves like "max:N" but also filters out systems whose atom counts exceed N, raising an error if no system remains. For added clarity, you might consider specifying which exception is raised or elaborating on the error behavior.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 281959e and aa07bf1.

📒 Files selected for processing (1)

doc/train/training-advanced.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

doc/train/training-advanced.md

[uncategorized] ~110-~110: Loose punctuation mark.
Context: ...m is no less than N. - "max:N": automatically determines the batch size...

(UNLIKELY_OPENING_PUNCTUATION)

⏰ Context from checks skipped due to timeout of 90000ms (24)

GitHub Check: Analyze (c-cpp)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Test C++ (false)
GitHub Check: Test C++ (true)
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)

🔇 Additional comments (1)

doc/train/training-advanced.md (1)

109-109: Clarify "auto:N" Documentation
The updated description for "auto:N" clearly states that the batch size is determined such that the product of the batch size and the number of atoms in a system is no less than N. This wording is consistent with previous documentation.

This PR adds two type of defining batch sizes: `max`: Unlike `auto`, `max` does not do ceiling to batch size. Consider the case: batch size is set as `auto:256`, and you've got a system with 255 atoms. In this case, `auto` will infer a batch size of 2, while `max` remains 1 to avoid potential OOM. `filter`: It removes any systems with atoms more than expected, and limits the total atom numbers in a batch like what `max` does. I've running out of idea what the keywords should be, so please leave a comment if there is a better idea. - [ ] Todo: migrate changes to Paddle backend  ## Summary by CodeRabbit ## Summary by CodeRabbit - **New Features** - Enhanced batch size configuration options now allow the use of descriptive prefixes for more flexible control, including `"max:N"` and `"filter:N"`. - Clarified the existing `"auto:N"` option for batch size configuration. - **Bug Fixes** - Improved validation ensures that batch sizes are always at least one. - Updated error feedback provides clear messaging for unsupported batch size formats. - **Tests** - Introduced a new test suite to validate batch size behavior under various input scenarios, including assertions for different formats and conditions.  --------- Signed-off-by: Chun Cai <amoycaic@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

feat: add new batch size rules for large systems

33f4c15

Copilot AI review requested due to automatic review settings March 17, 2025 07:18

github-actions bot added the Python label Mar 17, 2025

Copilot AI reviewed Mar 17, 2025

View reviewed changes

use logger for class scope

8ba1d28

wanghan-iapcm requested changes Mar 18, 2025

View reviewed changes

caic99 marked this pull request as draft March 18, 2025 03:22

caic99 and others added 6 commits March 20, 2025 15:23

ensure at least one system left

ec47d31

change keyword name to filter

18a0f13

add ut

f169a98

[pre-commit.ci] auto fixes from pre-commit.com hooks

7ade4af

for more information, see https://pre-commit.ci

add test for unknown batch size handling

10defcb

add debug info

199d9b7

coderabbitai bot reviewed Mar 20, 2025

View reviewed changes

source/tests/pt/test_dploaderset.py Outdated Show resolved Hide resolved

source/tests/pt/test_dploaderset.py Outdated Show resolved Hide resolved

caic99 and others added 2 commits March 20, 2025 17:07

Update source/tests/pt/test_dploaderset.py

63392f6

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Chun Cai <amoycaic@gmail.com>

Update source/tests/pt/test_dploaderset.py

5bf8fff

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Chun Cai <amoycaic@gmail.com>

coderabbitai bot reviewed Mar 20, 2025

View reviewed changes

test: capture log by explicitly setting logger name deepmd

5fa3582

caic99 requested review from Copilot and iProzd March 21, 2025 04:15

Copilot AI reviewed Mar 21, 2025

View reviewed changes

njzjz requested changes Mar 21, 2025

View reviewed changes

add docs

b43a4ab

caic99 requested a review from njzjz March 21, 2025 06:01

github-actions bot added the Docs label Mar 21, 2025

coderabbitai bot reviewed Mar 21, 2025

View reviewed changes

njzjz requested changes Mar 21, 2025

View reviewed changes

update argcheck params

456868f

caic99 requested review from njzjz March 21, 2025 06:57

njzjz requested changes Mar 22, 2025

View reviewed changes

deepmd/pt/utils/dataloader.py Show resolved Hide resolved

deepmd/pt/utils/dataloader.py Outdated Show resolved Hide resolved

caic99 added 2 commits March 24, 2025 10:29

avoid using assert

8a5fc80

port the impl to paddle backend

281959e

coderabbitai bot reviewed Mar 24, 2025

View reviewed changes

update docs

aa07bf1

coderabbitai bot reviewed Mar 24, 2025

View reviewed changes

caic99 marked this pull request as ready for review March 24, 2025 03:11

fix UT

bb41847

njzjz approved these changes Mar 25, 2025

View reviewed changes

wanghan-iapcm approved these changes Mar 26, 2025

View reviewed changes

iProzd approved these changes Mar 26, 2025

View reviewed changes

iProzd enabled auto-merge March 26, 2025 04:30

iProzd added this pull request to the merge queue Mar 26, 2025

Merged via the queue into deepmodeling:devel with commit 2e4da81 Mar 26, 2025
60 checks passed

caic99 deleted the feat-bs-rules branch March 26, 2025 07:28

coderabbitai bot mentioned this pull request Aug 10, 2025

Fix: support "max:N" and "filter:N" batch_size rules in DeepmdDataSystem #4876

Merged

	doc_batch_size = f'This key can be \n\n\
	- list: the length of which is the same as the {link_sys}. The batch size of each system is given by the elements of the list.\n\n\
	- int: all {link_sys} use the same batch size.\n\n\
	- string "auto": automatically determines the batch size so that the batch_size times the number of atoms in the system is no less than 32.\n\n\
	- string "auto:N": automatically determines the batch size so that the batch_size times the number of atoms in the system is no less than N.\n\n\
	- string "mixed:N": the batch data will be sampled from all systems and merged into a mixed system with the batch size N. Only support the se_atten descriptor for TensorFlow backend.\n\n\
	If MPI is used, the value should be considered as the batch size per task.'

feat: add new batch size rules for large systems #4659

feat: add new batch size rules for large systems #4659

Uh oh!

Conversation

caic99 commented Mar 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

coderabbitai bot commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

codecov bot commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

njzjz commented Mar 17, 2025

Uh oh!

caic99 commented Mar 18, 2025

Uh oh!

wanghan-iapcm left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

njzjz left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

njzjz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

caic99 commented Mar 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 17, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Mar 17, 2025 •

edited

Loading