Skip to content

Conversation

@OutisLi
Copy link
Collaborator

@OutisLi OutisLi commented Nov 10, 2025

Bug fix when type num in training system is less than type map. Fix issue #5042 @

Summary by CodeRabbit

  • Refactor
    • Optimized internal atom-type mapping to make data import and processing more efficient.
    • Maintains identical behavior and output formats; no changes to public interfaces or exported declarations.
    • Improves performance and observability during data handling without altering results or user-facing functionality.

Copilot AI review requested due to automatic review settings November 10, 2025 07:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR reverts a previous optimization that used vectorized NumPy operations for remapping atom types to a simpler list comprehension approach.

Key Changes:

  • Replaces vectorized NumPy array indexing with a list comprehension-based approach for remapping atom types when a global type_map is provided

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 10, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Replace loop-based index remapping in non-mixed_type path with a vectorized name-based lookup: build a NumPy int32 array from [type_map.index(name) for name in self.type_map] and remap self.atom_type via that array. Behavior and self.atom_type dtype remain the same; mixed_type path unchanged.

Changes

Cohort / File(s) Summary
Type mapping refactor
deepmd/utils/data.py
Replace per-element index-based mapping with a vectorized name-based lookup: construct a NumPy int32 array from [type_map.index(name) for name in self.type_map] and use it to remap self.atom_type. No changes to mixed_type branch or downstream control flow.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Single-file, localized refactor with functional equivalence.
  • Review attention:
    • Verify the constructed NumPy array correctly reflects indices for all entries in self.type_map.
    • Confirm self.atom_type dtype remains int32 and mapping preserves values for edge cases (e.g., duplicate names).

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: optimize atom type mapping' accurately reflects the main change: a vectorized optimization of atom type mapping logic in deepmd/utils/data.py.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
deepmd/utils/data.py (1)

107-108: Consider vectorizing with NumPy for efficiency.

The list comprehensions work correctly but could be replaced with NumPy vectorization for better performance, though this is a one-time initialization operation.

Example vectorized approach:

atom_names = np.array(self.type_map)[self.atom_type]
atom_type_ = np.array([global_type_dict[name] for name in atom_names])

Or fully vectorized:

# Create a mapping array from local to global indices
type_mapping = np.array([global_type_dict[name] for name in self.type_map])
self.atom_type = type_mapping[self.atom_type].astype(np.int32)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7f25e16 and 96da487.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
  • GitHub Check: CodeQL analysis (python)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
🔇 Additional comments (1)
deepmd/utils/data.py (1)

106-109: Logic change correctly addresses the PR objective.

The dictionary-based name lookup approach properly handles the case where the training system has fewer types than the global type_map. The validation at lines 100-104 ensures all local type names exist in the global type_map, making this implementation safe.

@codecov
Copy link

codecov bot commented Nov 10, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.19%. Comparing base (7f25e16) to head (ff8aee9).
⚠️ Report is 1 commits behind head on devel.

Additional details and impacted files
@@            Coverage Diff             @@
##            devel    #5043      +/-   ##
==========================================
- Coverage   84.19%   84.19%   -0.01%     
==========================================
  Files         709      709              
  Lines       70216    70213       -3     
  Branches     3621     3620       -1     
==========================================
- Hits        59116    59113       -3     
  Misses       9933     9933              
  Partials     1167     1167              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
deepmd/utils/data.py (1)

106-109: Excellent performance improvement with vectorized remapping.

The vectorized approach correctly builds a mapping array and applies it using NumPy indexing, which is significantly faster than the previous loop-based implementation for systems with many atoms. This addresses the performance concern raised in the earlier review.

Optional refinement: The .astype(np.int32) at the end of line 109 is redundant—indexing old_to_new_type_idx (which is already dtype=np.int32) returns an int32 array, so the explicit cast is unnecessary.

Optional: Remove redundant type cast
                old_to_new_type_idx = np.array(
                    [type_map.index(name) for name in self.type_map], dtype=np.int32
                )
-                self.atom_type = old_to_new_type_idx[self.atom_type].astype(np.int32)
+                self.atom_type = old_to_new_type_idx[self.atom_type]
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96da487 and ff8aee9.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)

@njzjz njzjz linked an issue Nov 11, 2025 that may be closed by this pull request
@njzjz njzjz added this pull request to the merge queue Nov 11, 2025
Merged via the queue into deepmodeling:devel with commit 4ec0437 Nov 11, 2025
60 checks passed
iProzd pushed a commit to iProzd/deepmd-kit that referenced this pull request Nov 11, 2025
Bug fix when type num in training system is less than type map. Fix
issue deepmodeling#5042 @

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Optimized internal atom-type mapping to make data import and
processing more efficient.
* Maintains identical behavior and output formats; no changes to public
interfaces or exported declarations.
* Improves performance and observability during data handling without
altering results or user-facing functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@OutisLi OutisLi deleted the pr/loadfix branch November 13, 2025 02:07
ChiahsinChu pushed a commit to ChiahsinChu/deepmd-kit that referenced this pull request Dec 17, 2025
Bug fix when type num in training system is less than type map. Fix
issue deepmodeling#5042 @

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Optimized internal atom-type mapping to make data import and
processing more efficient.
* Maintains identical behavior and output formats; no changes to public
interfaces or exported declarations.
* Improves performance and observability during data handling without
altering results or user-facing functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Dataset loading IndexError raised in new devel branch

3 participants