-
Notifications
You must be signed in to change notification settings - Fork 584
Feat:support customized rglob #4763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat:support customized rglob #4763
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces support for customized recursive globbing to search system directories based on user-specified patterns.
- Added a new function, rglob_sys_str, to recursively search directories using provided glob patterns.
- Updated process_systems to accept an optional patterns parameter and modified the main entrypoint to pass custom rglob patterns.
- Adjusted relevant imports and function calls to incorporate the new behavior.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| deepmd/utils/data_system.py | Updated process_systems to accept custom patterns and invoke rglob_sys_str when provided. |
| deepmd/pt/entrypoints/main.py | Modified training/validation system processing to pass rglob_patterns to process_systems. |
| deepmd/common.py | Added rglob_sys_str to support customized recursive globbing based on given patterns. |
Comments suppressed due to low confidence (1)
deepmd/utils/data_system.py:753
- Consider clarifying in the docstring that the 'patterns' parameter only affects cases where 'systems' is a single directory string, as list inputs will bypass custom pattern filtering.
if isinstance(systems, str):
for more information, see https://pre-commit.ci
📝 WalkthroughWalkthroughA new function for recursive directory searching with glob patterns was added. The system processing logic was updated to optionally use this new function when patterns are provided. The main training entrypoint now retrieves and passes these patterns from dataset parameters, enabling pattern-based filtering for training and validation system selection. Argument checking functions were extended to accept these new pattern parameters. A new test class was added to verify the customized rglob pattern functionality. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant MainEntrypoint
participant DataSystem
participant Common
User->>MainEntrypoint: Initiate training/validation
MainEntrypoint->>MainEntrypoint: Retrieve rglob_patterns from dataset params
MainEntrypoint->>DataSystem: process_systems(systems, patterns)
alt patterns provided
DataSystem->>Common: rglob_sys_str(systems, patterns)
Common-->>DataSystem: List of matching system paths
else patterns not provided
DataSystem->>Common: expand_sys_str(systems)
Common-->>DataSystem: List of system paths
end
DataSystem-->>MainEntrypoint: Processed system list
MainEntrypoint-->>User: Continue with training/validation
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Pylint (3.3.7)source/tests/pt/test_training.pyNo files to lint: exiting. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (22)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
deepmd/common.py (1)
210-231:⚠️ Potential issueAddress type annotation inconsistency and implement deduplication.
Several issues identified in the new function:
- Type annotation mismatch: The docstring mentions
root_dir : str, Pathbut the parameter type annotation only showsstr.- Missing deduplication: As noted in the past review comment, multiple patterns could match the same directory, leading to duplicate entries.
- No input validation: The function doesn't validate that
root_direxists or thatpatternsis not empty.Apply this diff to fix the issues:
-def rglob_sys_str(root_dir: str, patterns: list[str]) -> list[str]: +def rglob_sys_str(root_dir: Union[str, Path], patterns: list[str]) -> list[str]: """Recursively iterate over directories taking those that contain `type.raw` file. Parameters ---------- - root_dir : str, Path + root_dir : Union[str, Path] starting directory patterns : list[str] list of glob patterns to match directories Returns ------- list[str] list of string pointing to system directories """ + if not patterns: + return [] root_dir = Path(root_dir) + if not root_dir.exists(): + raise FileNotFoundError(f"Root directory {root_dir} does not exist") matches = [] for pattern in patterns: matches.extend( [str(d) for d in root_dir.rglob(pattern) if (d / "type.raw").is_file()] ) - return matches + return list(set(matches))
🧹 Nitpick comments (2)
deepmd/pt/entrypoints/main.py (1)
117-121: Fix inconsistent parameter passing style.There's an inconsistency in how patterns are passed to
process_systems:
- Line 118 uses keyword argument:
patterns=trn_patterns- Line 121 uses positional argument:
val_patternsApply this diff for consistency:
if validation_systems is not None: val_patterns = validation_dataset_params.get("rglob_patterns", None) - validation_systems = process_systems(validation_systems, val_patterns) + validation_systems = process_systems(validation_systems, patterns=val_patterns)deepmd/utils/data_system.py (1)
734-758: Well-implemented integration with new rglob functionality.The implementation is excellent:
- Backward compatibility: The optional
patternsparameter maintains backward compatibility- Clear logic: The conditional branching between
rglob_sys_strandexpand_sys_stris intuitive- Updated documentation: The docstring correctly reflects the new parameter
Consider adding input validation for the patterns parameter:
if isinstance(systems, str): if patterns is None: systems = expand_sys_str(systems) else: + if not isinstance(patterns, list) or not patterns: + raise ValueError("patterns must be a non-empty list of strings") systems = rglob_sys_str(systems, patterns)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
deepmd/common.py(1 hunks)deepmd/pt/entrypoints/main.py(1 hunks)deepmd/utils/data_system.py(3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
deepmd/utils/data_system.py (1)
deepmd/common.py (2)
rglob_sys_str(210-231)expand_sys_str(190-207)
deepmd/common.py (1)
deepmd/utils/path.py (6)
rglob(94-107)rglob(228-242)rglob(395-409)is_file(110-111)is_file(244-246)is_file(432-436)
⏰ Context from checks skipped due to timeout of 90000ms (27)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Test C++ (false)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Test C++ (true)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Analyze (python)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (c-cpp)
🔇 Additional comments (1)
deepmd/utils/data_system.py (1)
20-20: LGTM!The import addition is correctly placed and necessary for the new functionality.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## devel #4763 +/- ##
=======================================
Coverage 84.79% 84.80%
=======================================
Files 698 698
Lines 67775 67786 +11
Branches 3544 3542 -2
=======================================
+ Hits 57472 57484 +12
+ Misses 9169 9167 -2
- Partials 1134 1135 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds customizable recursive globbing for dataset directories, enabling users to specify patterns when collecting training and validation systems.
- Introduce
rglob_sys_strincommon.pyfor pattern-based discovery - Update
process_systemsto accept optional pattern list - Expose
rglob_patternsin argument checks and pass through entrypoints
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| deepmd/common.py | New rglob_sys_str function to match directories by glob list |
| deepmd/utils/data_system.py | Updated process_systems to take patterns and call rglob |
| deepmd/utils/argcheck.py | Added rglob_patterns Argument to training/validation schemas |
| deepmd/pt/entrypoints/main.py | Pass rglob_patterns into process_systems for both datasets |
Comments suppressed due to low confidence (5)
deepmd/common.py:210
- The new function
rglob_sys_strcurrently has no associated unit tests; consider adding tests to cover pattern matching and duplicate removal.
def rglob_sys_str(root_dir: str, patterns: list[str]) -> list[str]:
deepmd/utils/data_system.py:734
- The signature uses
Optional[list[str]]forpatternsbutOptionalis not imported; please addfrom typing import Optional.
def process_systems(
deepmd/common.py:210
- The parameter
root_diris annotated asstrbut docstring mentionsPath; consider usingUnion[str, Path]for consistency.
def rglob_sys_str(root_dir: str, patterns: list[str]) -> list[str]:
deepmd/common.py:231
- [nitpick] Converting to a set and back loses original ordering; consider deduplicating into a set and returning a sorted list for deterministic behavior.
return list(set(matches)) # remove duplicates
deepmd/pt/entrypoints/main.py:121
- [nitpick] Use a keyword argument (
patterns=val_patterns) when callingprocess_systemsto match the earlier call and improve readability.
validation_systems = process_systems(validation_systems, val_patterns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
deepmd/utils/data_system.py (1)
788-789: Apply minor style improvement suggested by static analysis.The functionality is correct, but there's a minor style improvement that can be applied.
- rglob_patterns = jdata.get("rglob_patterns", None) + rglob_patterns = jdata.get("rglob_patterns")Since
dict.get()returnsNoneby default when the key doesn't exist, explicitly passingNoneas the default value is redundant.🧰 Tools
🪛 Ruff (0.11.9)
788-788: Use
jdata.get("rglob_patterns")instead ofjdata.get("rglob_patterns", None)Replace
jdata.get("rglob_patterns", None)withjdata.get("rglob_patterns")(SIM910)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/utils/data_system.py(4 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
deepmd/utils/data_system.py
788-788: Use jdata.get("rglob_patterns") instead of jdata.get("rglob_patterns", None)
Replace jdata.get("rglob_patterns", None) with jdata.get("rglob_patterns")
(SIM910)
⏰ Context from checks skipped due to timeout of 90000ms (29)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Analyze (python)
- GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Build C++ (cpu, cpu)
🔇 Additional comments (2)
deepmd/utils/data_system.py (2)
734-761: LGTM: Well-designed function enhancement with backward compatibility.The
process_systemsfunction enhancement correctly:
- Maintains backward compatibility when
patternsisNone- Adds the new functionality when patterns are provided
- Uses appropriate type hints and documentation
- Follows a clear conditional logic flow
20-20:Details
✅ Verification successful
LGTM: Import addition for new functionality.
The import of
rglob_sys_stris correctly added to support the new customizable rglob functionality.However, let me verify that this function exists in the expected location:
🏁 Script executed:
#!/bin/bash # Description: Verify that rglob_sys_str function exists in deepmd.common # Expected: Function definition should be found ast-grep --pattern $'def rglob_sys_str($$$) { $$$ }'Length of output: 55
To ensure we’re importing a valid symbol, let’s search the repository for its definition:
🏁 Script executed:
#!/bin/bash # Search for the definition of rglob_sys_str in the codebase rg -n "def rglob_sys_str" .Length of output: 120
Import Verified:
rglob_sys_strIs Defined
Confirmed thatrglob_sys_stris implemented at line 210 indeepmd/common.py. The import indeepmd/utils/data_system.pyis correct—no further changes needed. LGTM.
njzjz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR lacks UT
What do you think would be the best way to handle the tests? A separate UT for the individual function or an end-to-end integration test in PT test training? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deepmd/utils/argcheck.py(4 hunks)source/tests/pt/test_training.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- deepmd/utils/argcheck.py
⏰ Context from checks skipped due to timeout of 90000ms (22)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Analyze (c-cpp)
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for customized recursive globbing (rglob) to enable flexible dataset construction by allowing users to specify custom glob patterns for training and validation data directories.
- Introduced new parameters ("rglob_patterns") in data processing functions and argument configurations.
- Added a new helper function (rglob_sys_str) in the common module.
- Updated the main entry and tests to utilize the customizable rglob functionality.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| source/tests/pt/test_training.py | Added a test class for verifying customized rglob functionality. |
| deepmd/utils/data_system.py | Updated process_systems to accept custom glob patterns for system search. |
| deepmd/utils/argcheck.py | Added new argument definitions for rglob_patterns in training and validation. |
| deepmd/pt/entrypoints/main.py | Modified system processing calls to include custom glob patterns. |
| deepmd/common.py | Added the helper function rglob_sys_str for globbing with custom patterns. |
Currently, when passing a single str as the root data directory, the
expand_sys_strfunction will automatically performrglobto grab all systems. However, this depends on the structure of the data folder. There are scenarios where train/val folders are nested, i.e. "root/dataset_*/trn" & "root/dataset_*/val".A customizable rglob function is needed to provide more flexibility when constructing datasets, and to remove unnecessarily long data lists in the input file.
Summary by CodeRabbit