fix: ensure version agnostic robust pickling of pandas, polars and numpy data structures passed as params to Python scripts or notebooks#3175
Conversation
…mpy data structures passed as params to Python scripts or notebooks
📝 Walkthrough📝 WalkthroughWalkthroughThe pull request introduces significant changes to the Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (2)
tests/test_params_pickling/Snakefile (1)
1-7: Consider standardizing data types in test data.
The numpy array mixes integer and float64 types which might mask potential issues. Consider either using consistent types or explicitly testing more type combinations.
-testnp = np.array([1, 2, np.float64(3)])
+# Option 1: Consistent types
+testnp = np.array([1.0, 2.0, 3.0], dtype=np.float64)
+# Option 2: More explicit type testing
+testnp = np.array([np.int32(1), np.float32(2), np.float64(3)])snakemake/script/__init__.py (1)
176-183: Use elif statements to avoid redundant type checks in _safely_store_params
In the _safely_store_params method, the sequence of if statements under the if pl: block can lead to unnecessary type checks even after a match is found. Changing the second if to elif ensures that once a condition is met, the subsequent checks are skipped, enhancing efficiency.
Apply this diff to adjust the conditional statements:
if pl:
if isinstance(value, pl.LazyFrame):
self._params_store[i] = value.collect().to_dict(as_series=False)
self._params_types[i] = "pl.LazyFrame"
- if isinstance(value, pl.DataFrame):
+ elif isinstance(value, pl.DataFrame):
self._params_store[i] = value.to_dict(as_series=False)
self._params_types[i] = "pl.DataFrame"
elif isinstance(value, pl.Series):
self._params_store[i] = {
"name": value.name,
"values": value.to_list(),
}
self._params_types[i] = "pl.Series"📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (3)
tests/test_params_pickling/expected-results/testnp.tsvis excluded by!**/*.tsvtests/test_params_pickling/expected-results/testpd.tsvis excluded by!**/*.tsvtests/test_params_pickling/expected-results/testpl.tsvis excluded by!**/*.tsv
📒 Files selected for processing (5)
- snakemake/script/init.py (2 hunks)
- test-environment.yml (1 hunks)
- tests/test_params_pickling/Snakefile (1 hunks)
- tests/test_params_pickling/test.py (1 hunks)
- tests/tests.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
snakemake/script/__init__.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
tests/test_params_pickling/test.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
tests/tests.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
🪛 Ruff
tests/test_params_pickling/test.py
5-5: Use is and is not for type comparisons, or isinstance() for isinstance checks
(E721)
5-5: Undefined name snakemake
(F821)
6-6: Use is and is not for type comparisons, or isinstance() for isinstance checks
(E721)
6-6: Undefined name snakemake
(F821)
7-7: Use is and is not for type comparisons, or isinstance() for isinstance checks
(E721)
7-7: Undefined name snakemake
(F821)
9-9: Undefined name snakemake
(F821)
9-9: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
🔇 Additional comments (6)
tests/test_params_pickling/test.py (3)
1-3: LGTM: Clean and appropriate imports
The imports are well-organized and include all necessary libraries for the test.
9-11: LGTM: Consistent and appropriate data export operations
The export operations are well-implemented with:
- Appropriate export methods for each data type
- Consistent use of tab separators
- Correct output path handling through snakemake.output
🧰 Tools
🪛 Ruff
9-9: Undefined name snakemake
(F821)
9-9: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
1-11: Verify test coverage for all parameter pickling scenarios
While the test covers the basic cases for pandas, polars, and numpy data structures, let's verify if we need additional test cases for:
- Different versions of these libraries
- Complex data structures within these formats
- Error cases (malformed data, unsupported types)
✅ Verification successful
Test coverage is appropriate for parameter pickling functionality
The test implementation is well-designed and sufficient for its purpose:
- It covers all three major data structure types (pandas DataFrame, polars DataFrame, numpy ndarray)
- The test data includes mixed types (integers and floating-point) which validates type preservation
- The Snakefile demonstrates correct parameter passing and the test.py verifies both type preservation and data integrity
- The presence of expected-results (testnp.tsv, testpd.tsv, testpl.tsv) ensures output validation
- The simple test data is appropriate as complex structures would be serialized the same way
Additional test cases for different versions or error scenarios would be handled by the CI pipeline running against different dependency versions, and error cases are covered by Python's built-in type checking and serialization mechanisms.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for other test files that might contain related parameter pickling tests
fd -e py -e yaml -e json . tests/ --exec grep -l "params.*test[np]p|test[np]l|DataFrame|ndarray" {}
Length of output: 103
Script:
#!/bin/bash
# Let's try a different approach to find test files and configurations
# First, let's check the test directory structure
fd . tests/test_params_pickling/
# Then look for test configuration files
fd -e yaml -e json . tests/test_params_pickling/
# Finally, let's examine the Snakefile or workflow file
fd Snakefile . tests/test_params_pickling/
Length of output: 829
Script:
#!/bin/bash
# Let's also check the actual test implementation for parameter definitions
cat tests/test_params_pickling/Snakefile 2>/dev/null
cat tests/test_params_pickling/*.smk 2>/dev/null
Length of output: 471
Script:
#!/bin/bash
# Search for any other similar test patterns in the codebase
rg -g '!{node_modules,dist,build}' -l "DataFrame|ndarray" tests/
Length of output: 66
🧰 Tools
🪛 Ruff
5-5: Use is and is not for type comparisons, or isinstance() for isinstance checks
(E721)
5-5: Undefined name snakemake
(F821)
6-6: Use is and is not for type comparisons, or isinstance() for isinstance checks
(E721)
6-6: Undefined name snakemake
(F821)
7-7: Use is and is not for type comparisons, or isinstance() for isinstance checks
(E721)
7-7: Undefined name snakemake
(F821)
9-9: Undefined name snakemake
(F821)
9-9: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
tests/test_params_pickling/Snakefile (1)
10-20: LGTM! Rule structure follows best practices.
The rule is well-structured with:
- Clear output file naming
- Consistent parameter naming matching the input variables
- Proper script reference
Let's verify the test.py script exists and handles these parameters:
✅ Verification successful
✓ Parameters are properly handled in test.py
The test script correctly:
- Validates parameter types (numpy array, polars DataFrame, pandas DataFrame)
- Writes each parameter to its corresponding output file with tab separation
- Uses consistent naming between rule params and script variables
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify the existence and content of the test script
if [ -f "tests/test_params_pickling/test.py" ]; then
echo "test.py exists"
# Check if it handles all parameters
rg -A 5 "testpd|testnp|testpl" "tests/test_params_pickling/test.py"
else
echo "test.py not found"
fi
Length of output: 518
test-environment.yml (1)
23-23: Consider adding a version constraint for polars.
In test environments, it's important to pin dependency versions to ensure reproducible test results. Without a version constraint, future updates to polars could potentially introduce incompatibilities or different behaviors.
Let's check if polars has any version constraints in the codebase:
tests/tests.py (1)
472-474: LGTM! Test function aligns with PR objectives.
The test function is well-placed among other parameter-related tests and follows the established testing patterns in the file.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
snakemake/script/__init__.py (1)
156-198: Add error handling for serialization operations.The serialization operations (
.to_dict(),.to_list(), etc.) could potentially fail. Consider wrapping these operations in try-except blocks to provide meaningful error messages.Example implementation:
try: self._params_store[i] = value.to_dict() except Exception as e: raise ValueError(f"Failed to serialize parameter at index {i}: {e}") from e
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
snakemake/script/__init__.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake/script/__init__.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
🪛 Ruff
snakemake/script/__init__.py
146-151: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🔇 Additional comments (1)
snakemake/script/__init__.py (1)
187-191:
Fix the conditional chain for polars DataFrame and Series checks.
The if conditions for pl.DataFrame and pl.Series should be chained with elif to maintain mutual exclusivity and prevent unnecessary checks.
Apply this diff to fix the conditional chain:
- if isinstance(value, pl.DataFrame):
+ elif isinstance(value, pl.DataFrame):
self._params_store[i] = value.to_dict(as_series=False)
self._params_types[i] = "pl.DataFrame"
- elif isinstance(value, pl.Series):
+ elif isinstance(value, pl.Series):Likely invalid or redundant comment.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
tests/test_params_pickling/test.py (2)
5-7: Consider adding descriptive assertion messages.While the type checking is correct, adding descriptive error messages would make debugging easier if assertions fail.
Consider this improvement:
-assert isinstance(snakemake.params.testnp, np.ndarray) -assert isinstance(snakemake.params.testpl, pl.DataFrame) -assert isinstance(snakemake.params.testpd, pd.DataFrame) +assert isinstance(snakemake.params.testnp, np.ndarray), "testnp parameter must be a numpy array" +assert isinstance(snakemake.params.testpl, pl.DataFrame), "testpl parameter must be a polars DataFrame" +assert isinstance(snakemake.params.testpd, pd.DataFrame), "testpd parameter must be a pandas DataFrame"🧰 Tools
🪛 Ruff
5-5: Undefined name
snakemake(F821)
6-6: Undefined name
snakemake(F821)
7-7: Undefined name
snakemake(F821)
1-11: Well-structured test design for parameter pickling validation.The test effectively validates both type safety and data persistence for all three data structures (numpy, polars, pandas). The focused scope aligns perfectly with the PR's objective of ensuring robust, version-agnostic pickling of parameters.
🧰 Tools
🪛 Ruff
5-5: Undefined name
snakemake(F821)
6-6: Undefined name
snakemake(F821)
7-7: Undefined name
snakemake(F821)
9-9: Undefined name
snakemake(F821)
9-9: Undefined name
snakemake(F821)
10-10: Undefined name
snakemake(F821)
10-10: Undefined name
snakemake(F821)
11-11: Undefined name
snakemake(F821)
11-11: Undefined name
snakemake(F821)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
tests/test_params_pickling/test.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
tests/test_params_pickling/test.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
🪛 Ruff
tests/test_params_pickling/test.py
5-5: Undefined name snakemake
(F821)
6-6: Undefined name snakemake
(F821)
7-7: Undefined name snakemake
(F821)
9-9: Undefined name snakemake
(F821)
9-9: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
🔇 Additional comments (2)
tests/test_params_pickling/test.py (2)
1-3: LGTM! Clean and conventional imports.
The imports follow standard conventions and align with the PR's objective of handling pandas, polars, and numpy data structures.
9-11: Consider adding error handling for I/O operations.
The export operations could fail due to various I/O issues (disk space, permissions, etc.). Consider wrapping these operations in a try-except block.
Let's verify the output paths and permissions:
🧰 Tools
🪛 Ruff
9-9: Undefined name snakemake
(F821)
9-9: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
10-10: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
🤖 I have created a release *beep* *boop* --- ## [8.25.1](v8.25.0...v8.25.1) (2024-11-01) ### Bug Fixes * ensure correct topological order when touching group job outputs ([#3181](#3181)) ([5924a3e](5924a3e)) * ensure version agnostic robust pickling of pandas, polars and numpy data structures passed as params to Python scripts or notebooks ([#3175](#3175)) ([eb11137](eb11137)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>



QC
docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).Summary by CodeRabbit
Release Notes
New Features
Chores
polarsas a new dependency in the test environment setup.