feat: add support for validation of polars dataframe and lazyframe#3262
feat: add support for validation of polars dataframe and lazyframe#3262johanneskoester merged 16 commits intosnakemake:mainfrom
Conversation
📝 WalkthroughWalkthroughThe changes update test code and utility functions to support sample data handling using both Pandas and Polars libraries. In the tests (in the Snakefile), new methods for reading sample data into dictionaries and validating them against a schema are introduced. The utility module's Changes
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (19)
✅ Files skipped from review due to trivial changes (3)
🚧 Files skipped from review as they are similar to previous changes (5)
🧰 Additional context used📓 Path-based instructions (1)`**/*.py`: Do not try to improve formatting. Do not suggest ...
🪛 Ruff (0.8.2)snakemake/ioutils/input.py16-16: Do not assign a Rewrite (E731) 30-30: Within an (B904) 30-30: Undefined name (F821) 🪛 actionlint (1.7.4).github/workflows/main.yml52-52: "exclude" section should not be empty (syntax-check) 52-52: "exclude" section should not be empty (syntax-check) 🪛 YAMLlint (1.35.1).github/workflows/main.yml[error] 53-53: trailing spaces (trailing-spaces) [error] 218-218: no new line character at the end of file (new-line-at-end-of-file) ⏰ Context from checks skipped due to timeout of 90000ms (30)
🔇 Additional comments (25)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
tests/test_validate/Snakefile (1)
Line range hint
20-22: Critical: Fix incompatible DataFrame access in ruleall.The
allrule usessamples.indexwhich is pandas-specific syntax. This will fail with polars DataFrames which have a different API for accessing indices.Consider either:
- Storing the pandas DataFrame in a separate variable, or
- Adapting the index access for polars:
rule all: input: - expand("test.{sample}.txt", sample=samples.index) + # Option 1: Keep pandas DataFrame for rule expansion + expand("test.{sample}.txt", sample=pd_samples.index) + # Option 2: Adapt for polars + expand("test.{sample}.txt", sample=samples["sample"].to_list())
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_validate/Snakefile(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (30)
- GitHub Check: testing (10, 3.12)
- GitHub Check: testing (10, 3.11)
- GitHub Check: testing (9, 3.12)
- GitHub Check: testing (9, 3.11)
- GitHub Check: testing (8, 3.12)
- GitHub Check: testing (8, 3.11)
- GitHub Check: testing (7, 3.12)
- GitHub Check: testing (7, 3.11)
- GitHub Check: testing (6, 3.12)
- GitHub Check: testing (6, 3.11)
- GitHub Check: testing (5, 3.12)
- GitHub Check: testing-windows (10)
- GitHub Check: testing (5, 3.11)
- GitHub Check: testing-windows (9)
- GitHub Check: testing (4, 3.12)
- GitHub Check: testing-windows (8)
- GitHub Check: testing (4, 3.11)
- GitHub Check: testing-windows (7)
- GitHub Check: testing (3, 3.12)
- GitHub Check: testing-windows (6)
- GitHub Check: testing (3, 3.11)
- GitHub Check: testing-windows (5)
- GitHub Check: testing (2, 3.12)
- GitHub Check: testing-windows (4)
- GitHub Check: testing (2, 3.11)
- GitHub Check: testing-windows (3)
- GitHub Check: testing (1, 3.12)
- GitHub Check: testing-windows (2)
- GitHub Check: testing (1, 3.11)
- GitHub Check: testing-windows (1)
🔇 Additional comments (2)
tests/test_validate/Snakefile (2)
4-4: LGTM! Import statement follows conventions.The polars import is correctly placed and uses the conventional
plalias.
17-18: Verify LazyFrame validation support and consider performance implications.The addition of LazyFrame validation raises several concerns:
- Verify that the validation utility properly supports LazyFrame objects, as they behave differently from eager DataFrames.
- Consider the performance impact of multiple validation calls on the same data.
Consider refactoring to:
- Add explicit tests for both DataFrame and LazyFrame validation
- Document the behavior differences between eager and lazy validation
-samples = pl.scan_csv(config["samples"]) -validate(samples, "samples.schema.yaml") +# Test LazyFrame validation +lazy_samples = pl.scan_csv(config["samples"]) +validate(lazy_samples, "samples.schema.yaml")Let's verify the validation utility's LazyFrame support:
There was a problem hiding this comment.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
snakemake/utils.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake/utils.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
🪛 Ruff (0.8.2)
snakemake/utils.py
138-138: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
⏰ Context from checks skipped due to timeout of 90000ms (31)
- GitHub Check: testing (10, 3.12)
- GitHub Check: testing (10, 3.11)
- GitHub Check: testing (9, 3.12)
- GitHub Check: testing (9, 3.11)
- GitHub Check: testing (8, 3.12)
- GitHub Check: testing (8, 3.11)
- GitHub Check: testing (7, 3.12)
- GitHub Check: testing (7, 3.11)
- GitHub Check: testing (6, 3.12)
- GitHub Check: testing (6, 3.11)
- GitHub Check: testing-windows (10)
- GitHub Check: testing (5, 3.12)
- GitHub Check: testing-windows (9)
- GitHub Check: testing (5, 3.11)
- GitHub Check: testing-windows (8)
- GitHub Check: testing (4, 3.12)
- GitHub Check: testing-windows (7)
- GitHub Check: testing (4, 3.11)
- GitHub Check: testing-windows (6)
- GitHub Check: testing-windows (5)
- GitHub Check: testing (3, 3.12)
- GitHub Check: testing (3, 3.11)
- GitHub Check: testing-windows (4)
- GitHub Check: testing (2, 3.12)
- GitHub Check: testing-windows (3)
- GitHub Check: testing (2, 3.11)
- GitHub Check: testing-windows (2)
- GitHub Check: testing (1, 3.12)
- GitHub Check: testing (1, 3.11)
- GitHub Check: testing-windows (1)
- GitHub Check: apidocs
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
tests/test_validate/samples.tsvis excluded by!**/*.tsv
📒 Files selected for processing (4)
snakemake/utils.py(1 hunks)tests/test_validate/Snakefile(1 hunks)tests/test_validate/config.schema.yaml(1 hunks)tests/test_validate/samples.schema.yaml(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/test_validate/config.schema.yaml
🧰 Additional context used
📓 Path-based instructions (1)
snakemake/utils.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
🪛 Ruff (0.8.2)
snakemake/utils.py
141-141: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
165-165: Comparison to None should be cond is None
Replace with cond is None
(E711)
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: testing (10, 3.12)
- GitHub Check: testing (10, 3.11)
- GitHub Check: testing (9, 3.12)
- GitHub Check: testing (9, 3.11)
- GitHub Check: testing (8, 3.12)
- GitHub Check: testing (8, 3.11)
- GitHub Check: testing (7, 3.12)
- GitHub Check: testing (7, 3.11)
- GitHub Check: testing (5, 3.12)
- GitHub Check: testing-windows (10)
- GitHub Check: testing (5, 3.11)
- GitHub Check: testing-windows (9)
- GitHub Check: testing-windows (8)
- GitHub Check: testing-windows (7)
- GitHub Check: testing (3, 3.12)
- GitHub Check: testing-windows (6)
- GitHub Check: testing-windows (5)
- GitHub Check: testing-windows (4)
- GitHub Check: testing-windows (3)
- GitHub Check: testing-windows (2)
🔇 Additional comments (9)
tests/test_validate/samples.schema.yaml (1)
9-17: LGTM! Schema changes look good.The schema changes are well-structured:
- Simplified description for
conditionproperty improves clarity- New properties
nandtissuehave appropriate types and default valuestests/test_validate/Snakefile (5)
4-4: LGTM! Import statement is correct.The polars import is properly added.
14-24: LGTM! Polars DataFrame validation is well implemented.The implementation:
- Uses correct separator for TSV files
- Defines appropriate schema with correct data types
- Includes comprehensive assertions
26-34: LGTM! Polars LazyFrame validation is properly implemented.The implementation:
- Maintains consistency with DataFrame validation
- Correctly handles lazy evaluation
- Uses
set_default=Falseappropriately as LazyFrames don't support default values
36-41: LGTM! Pandas DataFrame validations are consistent.Both indexed and non-indexed DataFrame validations:
- Maintain consistent assertions with polars implementations
- Properly test the new schema properties
Also applies to: 54-59
43-52: LGTM! Dict validation provides good coverage.The implementation:
- Tests both complete and partial dictionaries
- Properly handles null values
- Validates schema defaults
snakemake/utils.py (3)
114-116: LGTM! Import and initialization are correct.The polars import and records initialization are properly implemented.
117-128: LGTM! DataFrame type handling is well implemented.Good implementation choices:
- Clear logging messages for each DataFrame type
- Efficient handling of LazyFrame by limiting to first 1000 rows
- Proper type checking order
130-141: 🛠️ Refactor suggestionUse exception chaining for better error tracking.
The error handling should use exception chaining to preserve the original traceback.
Apply this diff:
- raise WorkflowError(f"Error validating row {i} of data frame.", e) + raise WorkflowError(f"Error validating row {i} of data frame.") from eLikely invalid or redundant comment.
🧰 Tools
🪛 Ruff (0.8.2)
141-141: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
snakemake/utils.py (2)
114-128: LGTM! Good optimization for LazyFrame processing.The implementation correctly handles different DataFrame types and optimizes LazyFrame processing by limiting it to the first 1000 rows.
However, consider making the error message more descriptive:
- raise WorkflowError("Unsupported data type for validation.") + raise WorkflowError(f"Unsupported data type for validation. Expected pandas DataFrame, polars DataFrame, or polars LazyFrame, got {type(data)}.")
171-172: Improve error message for DataFrame validation.The error message could be more descriptive to help with debugging.
Apply this diff:
- raise WorkflowError("Error validating data frame.") + raise WorkflowError("Error validating data frame. Ensure the data is a pandas DataFrame, polars DataFrame, or polars LazyFrame.")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
snakemake/utils.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake/utils.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
🪛 Ruff (0.8.2)
snakemake/utils.py
141-141: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
⏰ Context from checks skipped due to timeout of 90000ms (31)
- GitHub Check: testing (10, 3.12)
- GitHub Check: testing (10, 3.11)
- GitHub Check: testing (9, 3.12)
- GitHub Check: testing (9, 3.11)
- GitHub Check: testing (8, 3.12)
- GitHub Check: testing (8, 3.11)
- GitHub Check: testing (7, 3.12)
- GitHub Check: testing (7, 3.11)
- GitHub Check: testing (6, 3.12)
- GitHub Check: testing (6, 3.11)
- GitHub Check: testing (5, 3.12)
- GitHub Check: testing-windows (10)
- GitHub Check: testing (5, 3.11)
- GitHub Check: testing-windows (9)
- GitHub Check: testing-windows (8)
- GitHub Check: testing (4, 3.12)
- GitHub Check: testing-windows (7)
- GitHub Check: testing (4, 3.11)
- GitHub Check: testing (3, 3.12)
- GitHub Check: testing-windows (6)
- GitHub Check: testing-windows (5)
- GitHub Check: testing (3, 3.11)
- GitHub Check: testing-windows (4)
- GitHub Check: testing (2, 3.12)
- GitHub Check: testing (2, 3.11)
- GitHub Check: testing-windows (3)
- GitHub Check: testing (1, 3.12)
- GitHub Check: testing-windows (2)
- GitHub Check: testing (1, 3.11)
- GitHub Check: testing-windows (1)
- GitHub Check: apidocs
🔇 Additional comments (2)
snakemake/utils.py (2)
143-167: 🛠️ Refactor suggestionOptimize polars DataFrame updates using vectorized operations.
The current implementation uses nested loops for updating polars DataFrame values, which is inefficient. Consider using vectorized operations instead.
Apply this diff to improve performance:
- # Fill in None values with values from newdata - for i in range(data.shape[0]): - for j in range(data.shape[1]): - if data[i, j] is None: - data[i, j] = newdata[i, j] + # Fill in None values with values from newdata using vectorized operations + data = data.with_columns([ + pl.when(pl.col(col).is_null()) + .then(newdata[col]) + .otherwise(pl.col(col)) + .alias(col) + for col in data.columns + ])Likely invalid or redundant comment.
130-141: 🛠️ Refactor suggestionAdd exception chaining to improve error traceability.
The exception handling should use proper chaining to preserve the original traceback.
Apply this diff:
- raise WorkflowError(f"Error validating row {i} of data frame.", e) + raise WorkflowError(f"Error validating row {i} of data frame.") from eAlso, consider using
polars.is_null()for polars DataFrames instead ofpandas.isnull():- record = {k: v for k, v in record.items() if not pd.isnull(v)} + record = {k: v for k, v in record.items() if not (pd.isnull(v) if isinstance(data, pd.DataFrame) else pl.is_null(v))}Likely invalid or redundant comment.
🧰 Tools
🪛 Ruff (0.8.2)
141-141: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/test_validate/Snakefile (1)
50-57: Final overshadowing of ‘samples’ before using it in the rule input.
While this final assignment is needed to accesssamples.index, it’s best practice to store the earlier Polars-based objects under different names. This ensures clarity when referencing the finalsamplesDataFrame for therule allinput.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_validate/Snakefile(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (40)
- GitHub Check: testing (10, 3.12, dash)
- GitHub Check: testing (10, 3.12, bash)
- GitHub Check: testing (10, 3.11, bash)
- GitHub Check: testing (9, 3.12, dash)
- GitHub Check: testing (9, 3.12, bash)
- GitHub Check: testing (9, 3.11, bash)
- GitHub Check: testing (8, 3.12, dash)
- GitHub Check: testing (8, 3.12, bash)
- GitHub Check: testing (8, 3.11, bash)
- GitHub Check: testing (7, 3.12, dash)
- GitHub Check: testing (7, 3.12, bash)
- GitHub Check: testing (7, 3.11, bash)
- GitHub Check: testing (6, 3.12, dash)
- GitHub Check: testing (6, 3.12, bash)
- GitHub Check: testing (6, 3.11, bash)
- GitHub Check: testing (5, 3.12, dash)
- GitHub Check: testing (5, 3.12, bash)
- GitHub Check: testing (5, 3.11, bash)
- GitHub Check: testing (4, 3.12, dash)
- GitHub Check: testing (4, 3.12, bash)
- GitHub Check: testing (4, 3.11, bash)
- GitHub Check: testing-windows (10)
- GitHub Check: testing (3, 3.12, dash)
- GitHub Check: testing-windows (9)
- GitHub Check: testing (3, 3.12, bash)
- GitHub Check: testing-windows (8)
- GitHub Check: testing (3, 3.11, bash)
- GitHub Check: testing-windows (7)
- GitHub Check: testing (2, 3.12, dash)
- GitHub Check: testing (2, 3.12, bash)
- GitHub Check: testing-windows (6)
- GitHub Check: testing (2, 3.11, bash)
- GitHub Check: testing-windows (5)
- GitHub Check: testing (1, 3.12, dash)
- GitHub Check: testing-windows (4)
- GitHub Check: testing (1, 3.12, bash)
- GitHub Check: testing-windows (3)
- GitHub Check: testing (1, 3.11, bash)
- GitHub Check: testing-windows (2)
- GitHub Check: testing-windows (1)
🔇 Additional comments (5)
tests/test_validate/Snakefile (5)
4-4: All good on the import statement.
This import is necessary for the Polars functionality below. No concerns here.
10-20: Repeated overshadowing of the 'samples' variable and schema verification needed.
We’ve previously noted that reassigningsamplesto different data types (Polars DF, Polars LazyFrame, Pandas DF, dict) can be confusing and prone to errors, particularly if you accidentally rely on old references. Additionally, the code checks for'tissue'but only'sample','condition', and'n'appear in the Polars schema. Ensure your YAML schema includes'tissue'or confirm that Polars gracefully infers it.
[duplicate_comment, request_verification]
22-31: Variable overshadowing repeated with Polars LazyFrame.
You are again overwritingsamples, making it difficult to track which DataFrame or LazyFrame is the active one. Consider using separate variables (e.g.polars_df,lazy_df, etc.) to enhance readability and avoid pitfalls.
32-37: Overwriting the ‘samples’ variable again for Pandas usage.
Reusing the ‘samples’ name makes the code more difficult to maintain and debug. A more descriptive variable name (e.g.pandas_df_noindex) can help distinguish it from the Polars data structures above.
39-48: Assigning a dict to ‘samples’ continues the overshadowing pattern.
Switching ‘samples’ from a DataFrame to a dict within the same file can be confusing. Keep each data structure in its own variable to minimize accidental mix-ups.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (3)
snakemake/utils.py (3)
138-138: 🛠️ Refactor suggestionUse exception chaining for better traceback information.
When re-raising an exception, use
fromto preserve the original traceback information.- raise WorkflowError(f"Error validating row {i} of data frame.", e) + raise WorkflowError(f"Error validating row {i} of data frame.") from e🧰 Tools
🪛 Ruff (0.8.2)
138-138: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
156-156: 🛠️ Refactor suggestionUse exception chaining for better traceback information.
When re-raising an exception, use
fromto preserve the original traceback information.- raise WorkflowError(f"Error validating row {i} of data frame.", e) + raise WorkflowError(f"Error validating row {i} of data frame.") from e🧰 Tools
🪛 Ruff (0.8.2)
156-156: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
184-184: 🛠️ Refactor suggestionUse exception chaining for better traceback information.
When re-raising an exception, use
fromto preserve the original traceback information.- raise WorkflowError(f"Error validating row {i} of data frame.", e) + raise WorkflowError(f"Error validating row {i} of data frame.") from e🧰 Tools
🪛 Ruff (0.8.2)
184-184: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
🧹 Nitpick comments (1)
snakemake/utils.py (1)
158-174: Consider optimizing Polars DataFrame updates with vectorized operations.The current implementation iterates through all cells in the DataFrame, which could be inefficient for large datasets. Consider using Polars' vectorized operations for better performance.
if set_default: newdata = pl.DataFrame(recordlist) # Add missing columns newcol = [col for col in newdata.columns if col not in data.columns] [ data.insert_column( len(data.columns), pl.lit(None, newdata[col].dtype).alias(col), ) for col in newcol ] # Fill in None values with values from newdata - for i in range(data.shape[0]): - for j in range(data.shape[1]): - if data[i, j] is None: - data[i, j] = newdata[i, j] + # Use vectorized operations to update null values + for col in data.columns: + if col in newdata.columns: + data = data.with_columns([ + pl.when(pl.col(col).is_null()) + .then(newdata[col]) + .otherwise(pl.col(col)) + .alias(col) + ])
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
snakemake/utils.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest ...
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
snakemake/utils.py
🪛 Ruff (0.8.2)
snakemake/utils.py
128-128: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
138-138: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
156-156: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
184-184: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
⏰ Context from checks skipped due to timeout of 90000ms (40)
- GitHub Check: testing (10, 3.12, dash)
- GitHub Check: testing (10, 3.12, bash)
- GitHub Check: testing (10, 3.11, bash)
- GitHub Check: testing (9, 3.12, dash)
- GitHub Check: testing (9, 3.12, bash)
- GitHub Check: testing (9, 3.11, bash)
- GitHub Check: testing (8, 3.12, dash)
- GitHub Check: testing (8, 3.12, bash)
- GitHub Check: testing (8, 3.11, bash)
- GitHub Check: testing (7, 3.12, dash)
- GitHub Check: testing (7, 3.12, bash)
- GitHub Check: testing (7, 3.11, bash)
- GitHub Check: testing (6, 3.12, dash)
- GitHub Check: testing (6, 3.12, bash)
- GitHub Check: testing (6, 3.11, bash)
- GitHub Check: testing (5, 3.12, dash)
- GitHub Check: testing (5, 3.12, bash)
- GitHub Check: testing (5, 3.11, bash)
- GitHub Check: testing (4, 3.12, dash)
- GitHub Check: testing (4, 3.12, bash)
- GitHub Check: testing (4, 3.11, bash)
- GitHub Check: testing-windows (10)
- GitHub Check: testing (3, 3.12, dash)
- GitHub Check: testing-windows (9)
- GitHub Check: testing (3, 3.12, bash)
- GitHub Check: testing-windows (8)
- GitHub Check: testing (3, 3.11, bash)
- GitHub Check: testing-windows (7)
- GitHub Check: testing (2, 3.12, dash)
- GitHub Check: testing-windows (6)
- GitHub Check: testing (2, 3.12, bash)
- GitHub Check: testing-windows (5)
- GitHub Check: testing (2, 3.11, bash)
- GitHub Check: testing (1, 3.12, dash)
- GitHub Check: testing-windows (4)
- GitHub Check: testing (1, 3.12, bash)
- GitHub Check: testing-windows (3)
- GitHub Check: testing (1, 3.11, bash)
- GitHub Check: testing-windows (2)
- GitHub Check: testing-windows (1)
🔇 Additional comments (3)
snakemake/utils.py (3)
114-122: Good refactoring of validation logic!The extraction of common validation logic into a helper function is a good design choice, making the code more maintainable and reducing duplication.
148-157: Good implementation of Polars DataFrame validation!The validation logic for Polars DataFrames follows a similar pattern to the pandas implementation, using appropriate Polars-specific methods like
iter_rows(named=True).🧰 Tools
🪛 Ruff (0.8.2)
156-156: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
175-188: Good implementation of LazyFrame validation with reasonable limitations.Limiting validation to the first 1000 rows is a sensible approach for potentially large LazyFrames. The warning about not supporting default values is appropriate since LazyFrames are immutable.
🧰 Tools
🪛 Ruff (0.8.2)
184-184: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
snakemake/utils.py (3)
148-161: Use exception chaining when validating polars DataFramesContinue the pattern of using exception chaining for better error diagnostics.
Apply this diff to improve error handling:
- raise WorkflowError(f"Error validating row {i} of data frame.", e) + raise WorkflowError(f"Error validating row {i} of data frame.") from e🧰 Tools
🪛 Ruff (0.8.2)
160-160: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
162-178: Consider using vectorized operations for polars DataFrame updatesThe current implementation uses nested loops to update the polars DataFrame, which could be inefficient for large datasets. Polars is designed for vectorized operations.
Consider this alternative approach using vectorized operations:
if set_default: newdata = pl.DataFrame(recordlist) # Add missing columns for col in [c for c in newdata.columns if c not in data.columns]: data = data.with_columns([ pl.lit(None, newdata[col].dtype).alias(col) ]) # Update existing values data = data.with_columns([ pl.when(pl.col(col).is_null()) .then(pl.col(col).fill_null(newdata[col])) .otherwise(pl.col(col)) .alias(col) for col in data.columns ])
179-193: Use exception chaining for LazyFrame validationConsistent with previous comments, use exception chaining here as well.
Apply this diff:
- raise WorkflowError(f"Error validating row {i} of data frame.", e) + raise WorkflowError(f"Error validating row {i} of data frame.") from e🧰 Tools
🪛 Ruff (0.8.2)
192-192: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
snakemake/utils.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest ...
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
snakemake/utils.py
🪛 Ruff (0.8.2)
snakemake/utils.py
126-126: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
138-138: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
160-160: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
192-192: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
⏰ Context from checks skipped due to timeout of 90000ms (23)
- GitHub Check: testing (10, 3.12, dash)
- GitHub Check: testing (10, 3.12, bash)
- GitHub Check: testing (10, 3.11, bash)
- GitHub Check: testing (4, 3.11, bash)
- GitHub Check: testing (3, 3.12, dash)
- GitHub Check: testing-windows (10)
- GitHub Check: testing (3, 3.12, bash)
- GitHub Check: testing-windows (9)
- GitHub Check: testing (3, 3.11, bash)
- GitHub Check: testing-windows (8)
- GitHub Check: testing (2, 3.12, dash)
- GitHub Check: testing-windows (7)
- GitHub Check: testing (2, 3.12, bash)
- GitHub Check: testing-windows (6)
- GitHub Check: testing (2, 3.11, bash)
- GitHub Check: testing-windows (5)
- GitHub Check: testing-windows (4)
- GitHub Check: testing (1, 3.12, dash)
- GitHub Check: testing (1, 3.12, bash)
- GitHub Check: testing-windows (3)
- GitHub Check: testing (1, 3.11, bash)
- GitHub Check: testing-windows (2)
- GitHub Check: testing-windows (1)
🔇 Additional comments (6)
snakemake/utils.py (6)
111-113: Ensure polars is correctly importedThe import statement now includes both pandas and polars, which is necessary for handling both types of DataFrames. The code follows past review comments by correctly importing polars as
pl.
114-120: Good refactoring with new helper functionExtracting the validation logic into
_validate_recordhelper function improves code maintainability by reducing duplication across different data types.
121-126: Use exception chaining for better error diagnosticsWhen re-raising exceptions, it's best practice to use exception chaining to preserve the original traceback.
Apply this diff to improve error handling:
- raise WorkflowError("Error validating config file.", e) + raise WorkflowError("Error validating config file.") from e🧰 Tools
🪛 Ruff (0.8.2)
126-126: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
128-138: Use exception chaining when validating pandas DataFramesSimilar to the previous comment, use exception chaining for better error diagnostics.
Apply this diff to improve error handling:
- raise WorkflowError(f"Error validating row {i} of data frame.", e) + raise WorkflowError(f"Error validating row {i} of data frame.") from e🧰 Tools
🪛 Ruff (0.8.2)
138-138: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
194-196: Good handling of LazyFrame limitationsThe code correctly warns that LazyFrames don't support setting default values, which is a clear limitation for users.
197-198: Appropriate error handling for unsupported data typesThe error message clearly communicates that the provided data type is not supported for validation.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
tests/test_validate/Snakefile (1)
14-25: Consider renaming the variable to avoid confusion.
Repeatedly overwritingsampleswith different DataFrame types (Polars DataFrame here, then LazyFrame, then Pandas) can be confusing and error-prone. If you need to switch between data types, consider storing them in separate variables (e.g.,pl_samples,pl_lazy_samples,pd_samples) before validation.-# Polars DataFrame -samples = pl.read_csv( +pl_samples = pl.read_csv( config["samples"], separator="\t", ... ) -validate(samples, "samples.schema.yaml") +validate(pl_samples, "samples.schema.yaml")snakemake/utils.py (3)
118-149: Use exception chaining in_validate_pandas().
In lines 132-134, re-raising theValidationErrorwithout chaining hides the original traceback. Preferraise WorkflowError(...) from eto preserve context.except jsonschema.exceptions.ValidationError as e: - raise WorkflowError( - f"Error validating row {i} of data frame.", e - ) + raise WorkflowError( + f"Error validating row {i} of data frame." + ) from e🧰 Tools
🪛 Ruff (0.8.2)
132-134: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
150-218: Several improvements recommended for_validate_polars().
- Use exception chaining when re-raising at lines 168-170 and 206-208.
- Consider vectorized updates for setting default values instead of row-by-row loops at lines 172-188, which can be slow for large data.
# Example for exception chaining: except jsonschema.exceptions.ValidationError as e: - raise WorkflowError(f"Error validating row {i} of data frame.", e) + raise WorkflowError(f"Error validating row {i} of data frame.") from e# Example for vectorized filling of missing values: - for i in range(data.shape[0]): - for j in range(data.shape[1]): - if data[i, j] is None: - data[i, j] = newdata[i, j] + # Instead, construct a mask of null rows and columns, then fill them with newdata columns + # or consider using pl.when/pl.col for a more efficient approach.🧰 Tools
🪛 Ruff (0.8.2)
168-170: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
206-208: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
219-226: Use exception chaining in the dictionary validation block as well.
At line 224, re-raising the exception withraise WorkflowError(...) from eis recommended to preserve the original traceback.except jsonschema.exceptions.ValidationError as e: - raise WorkflowError("Error validating config file.", e) + raise WorkflowError("Error validating config file.") from e🧰 Tools
🪛 Ruff (0.8.2)
224-224: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
tests/test_validate/Snakefile(2 hunks)tests/test_validate/Snakefile(1 hunks)snakemake/utils.py(1 hunks)snakemake/utils.py(2 hunks)tests/test_validate/Snakefile(1 hunks)snakemake/utils.py(1 hunks)tests/test_validate/Snakefile(2 hunks)snakemake/utils.py(1 hunks)snakemake/utils.py(2 hunks)snakemake/utils.py(5 hunks)snakemake/utils.py(2 hunks)snakemake/utils.py(1 hunks)tests/test_validate/Snakefile(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- snakemake/utils.py
🚧 Files skipped from review as they are similar to previous changes (7)
- snakemake/utils.py
- tests/test_validate/Snakefile
- snakemake/utils.py
- tests/test_validate/Snakefile
- tests/test_validate/Snakefile
- snakemake/utils.py
- snakemake/utils.py
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest ...
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
snakemake/utils.py
🪛 Ruff (0.8.2)
snakemake/utils.py
132-134: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
132-134: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
168-170: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
132-134: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
168-170: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
206-208: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
224-224: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
⏰ Context from checks skipped due to timeout of 90000ms (40)
- GitHub Check: testing (10, 3.12, dash)
- GitHub Check: testing (10, 3.12, bash)
- GitHub Check: testing (10, 3.11, bash)
- GitHub Check: testing (9, 3.12, dash)
- GitHub Check: testing (9, 3.12, bash)
- GitHub Check: testing (9, 3.11, bash)
- GitHub Check: testing (8, 3.12, dash)
- GitHub Check: testing (8, 3.12, bash)
- GitHub Check: testing (8, 3.11, bash)
- GitHub Check: testing (7, 3.12, dash)
- GitHub Check: testing (7, 3.12, bash)
- GitHub Check: testing (7, 3.11, bash)
- GitHub Check: testing (6, 3.12, dash)
- GitHub Check: testing (6, 3.12, bash)
- GitHub Check: testing (6, 3.11, bash)
- GitHub Check: testing (5, 3.12, dash)
- GitHub Check: testing (5, 3.12, bash)
- GitHub Check: testing (5, 3.11, bash)
- GitHub Check: testing (4, 3.12, dash)
- GitHub Check: testing (4, 3.12, bash)
- GitHub Check: testing (4, 3.11, bash)
- GitHub Check: testing (3, 3.12, dash)
- GitHub Check: testing-windows (10)
- GitHub Check: testing (3, 3.12, bash)
- GitHub Check: testing-windows (9)
- GitHub Check: testing (3, 3.11, bash)
- GitHub Check: testing-windows (8)
- GitHub Check: testing (2, 3.12, dash)
- GitHub Check: testing-windows (7)
- GitHub Check: testing (2, 3.12, bash)
- GitHub Check: testing-windows (6)
- GitHub Check: testing (2, 3.11, bash)
- GitHub Check: testing-windows (5)
- GitHub Check: testing (1, 3.12, dash)
- GitHub Check: testing-windows (4)
- GitHub Check: testing (1, 3.12, bash)
- GitHub Check: testing-windows (3)
- GitHub Check: testing (1, 3.11, bash)
- GitHub Check: testing-windows (2)
- GitHub Check: testing-windows (1)
🔇 Additional comments (5)
tests/test_validate/Snakefile (5)
26-35: Validate large LazyFrames with caution.
Reading 1000 rows from a largeLazyFrameis reasonable for quick checks. If you need more thorough validation, consider an additional mechanism to sample or partition the data to avoid excessive memory usage.
36-41: Polars to Pandas transition looks good.
No issues spotted. The validation flow for the Pandas DataFrame without an index is clear and straightforward.
43-52: Dictionary validation approach is effective.
Validating individual rows as dictionaries is a good complement to DataFrame-level checks.
54-59: Retaining the index in Pandas DataFrame is consistent.
The logic to retain and validate the “sample” column by setting it as the DataFrame index works properly. No concerns here.
64-64:Details
❓ Verification inconclusive
Confirm final assignment of
samples.
Even though you reassignsamplesmultiple times, you ultimately rely onsamples.indexhere. Please verify that this references the Pandas DataFrame with index from line 55 and not an unintended earlier version.
🏁 Script executed:
#!/bin/bash # Ensure references to `samples.index` come after the final assignment to Pandas. rg -A 3 'samples.index'Length of output: 231
Action Required: Verify that the final assignment of
samplesis correctly usedI didn’t find any occurrence of an outdated
samples.indexreference in the test file—the only instance appears in the documentation atdocs/snakefiles/configuration.rst. Please double-check that the final assignment intests/test_validate/Snakefile(particularly the DataFrame setup on line 55) is the one that’s ultimately used (i.e. that later references to.indexcome from this final, correct assignment rather than an earlier version).
There was a problem hiding this comment.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
tests/tests_using_conda.py (1)
332-345:⚠️ Potential issueRemove duplicated test functions
The test functions
test_get_log_both,test_get_log_stderr,test_get_log_stdout, andtest_get_log_complexare duplicated in this file. The same functions appear earlier at lines 318-331. These duplicates should be removed to prevent confusion and potential testing issues.-def test_get_log_both(): - run(dpath("test_get_log_both")) - - -def test_get_log_stderr(): - run(dpath("test_get_log_stderr")) - - -def test_get_log_stdout(): - run(dpath("test_get_log_stdout")) - - -def test_get_log_complex(): - run(dpath("test_get_log_complex"))
🧹 Nitpick comments (3)
snakemake/script/__init__.py (1)
56-102: Well-implementedReportHrefclass for report hyperlink management.The class provides a clean interface for creating and manipulating report hyperlinks, with methods for child paths, URL arguments, and anchors. The implementation properly handles the hierarchical relationships and URL encoding.
The code in lines 97-100 can be simplified using a ternary operator:
- if self._anchor: - anchor = f"#{urllib.parse.quote(self._anchor)}" - else: - anchor = "" + anchor = f"#{urllib.parse.quote(self._anchor)}" if self._anchor else ""🧰 Tools
🪛 Ruff (0.8.2)
97-100: Use ternary operator
anchor = f"#{urllib.parse.quote(self._anchor)}" if self._anchor else ""instead ofif-else-blockReplace
if-else-block withanchor = f"#{urllib.parse.quote(self._anchor)}" if self._anchor else ""(SIM108)
snakemake/workflow.py (1)
127-127: Fix unused import in module.The
apimodule is imported but not used anywhere in the file.-from snakemake import api, caching, sourcecache +from snakemake import caching, sourcecache🧰 Tools
🪛 Ruff (0.8.2)
127-127:
snakemake.apiimported but unusedRemove unused import:
snakemake.api(F401)
docs/snakefiles/reporting.rst (1)
137-140: Dynamic Metadata for Report Items:
Allowing dynamic determination of category, subcategory, and labels through functions adds useful flexibility to report configuration. To further benefit users, consider adding a short code example demonstrating a function that returns these values, along with any type expectations.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
tests/test_script_xsh/expected-results/test.outis excluded by!**/*.out
📒 Files selected for processing (23)
docs/snakefiles/deployment.rst(4 hunks)docs/snakefiles/reporting.rst(4 hunks)docs/snakefiles/rules.rst(2 hunks)snakemake/cli.py(4 hunks)snakemake/script/__init__.py(4 hunks)snakemake/workflow.py(4 hunks)tests/tests_using_conda.py(1 hunks)docs/snakefiles/deployment.rst(2 hunks)docs/snakefiles/reporting.rst(5 hunks)docs/snakefiles/rules.rst(1 hunks)snakemake/cli.py(1 hunks)snakemake/script/__init__.py(3 hunks)snakemake/workflow.py(1 hunks)tests/test_conda_python_3_7_script/Snakefile(1 hunks)tests/test_conda_python_3_7_script/test_script_python_3_7.py(1 hunks)tests/test_conda_run/Snakefile(1 hunks)tests/test_conda_run/expected-results/test.txt(1 hunks)tests/test_conda_run/test_python_env.yaml(1 hunks)tests/test_conda_run/test_script_run.py(1 hunks)tests/test_script_xsh/Snakefile(1 hunks)tests/test_script_xsh/envs/xonsh.yaml(1 hunks)tests/test_script_xsh/scripts/test.xsh(1 hunks)tests/tests_using_conda.py(1 hunks)
✅ Files skipped from review due to trivial changes (6)
- tests/test_conda_run/expected-results/test.txt
- tests/test_conda_python_3_7_script/test_script_python_3_7.py
- tests/test_script_xsh/scripts/test.xsh
- tests/test_conda_run/test_script_run.py
- tests/test_conda_run/test_python_env.yaml
- tests/test_script_xsh/envs/xonsh.yaml
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest ...
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
tests/tests_using_conda.pysnakemake/cli.pysnakemake/workflow.pysnakemake/script/__init__.py
🪛 Ruff (0.8.2)
snakemake/workflow.py
1812-1812: Avoid inequality comparisons to False; use if ruleinfo.container_img: for truth checks
Replace with ruleinfo.container_img
(E712)
127-127: snakemake.api imported but unused
Remove unused import: snakemake.api
(F401)
snakemake/script/__init__.py
97-100: Use ternary operator anchor = f"#{urllib.parse.quote(self._anchor)}" if self._anchor else "" instead of if-else-block
Replace if-else-block with anchor = f"#{urllib.parse.quote(self._anchor)}" if self._anchor else ""
(SIM108)
⏰ Context from checks skipped due to timeout of 90000ms (40)
- GitHub Check: testing (10, 3.12, dash)
- GitHub Check: testing (10, 3.12, bash)
- GitHub Check: testing (10, 3.11, bash)
- GitHub Check: testing (9, 3.12, dash)
- GitHub Check: testing (9, 3.12, bash)
- GitHub Check: testing (9, 3.11, bash)
- GitHub Check: testing (8, 3.12, dash)
- GitHub Check: testing (8, 3.12, bash)
- GitHub Check: testing (8, 3.11, bash)
- GitHub Check: testing (7, 3.12, dash)
- GitHub Check: testing (7, 3.12, bash)
- GitHub Check: testing (7, 3.11, bash)
- GitHub Check: testing (6, 3.12, dash)
- GitHub Check: testing (6, 3.12, bash)
- GitHub Check: testing (6, 3.11, bash)
- GitHub Check: testing (5, 3.12, dash)
- GitHub Check: testing (5, 3.12, bash)
- GitHub Check: testing (5, 3.11, bash)
- GitHub Check: testing (4, 3.12, dash)
- GitHub Check: testing (4, 3.12, bash)
- GitHub Check: testing (4, 3.11, bash)
- GitHub Check: testing-windows (10)
- GitHub Check: testing (3, 3.12, dash)
- GitHub Check: testing-windows (9)
- GitHub Check: testing (3, 3.12, bash)
- GitHub Check: testing (3, 3.11, bash)
- GitHub Check: testing-windows (8)
- GitHub Check: testing (2, 3.12, dash)
- GitHub Check: testing-windows (7)
- GitHub Check: testing (2, 3.12, bash)
- GitHub Check: testing-windows (6)
- GitHub Check: testing (2, 3.11, bash)
- GitHub Check: testing-windows (5)
- GitHub Check: testing-windows (4)
- GitHub Check: testing (1, 3.12, dash)
- GitHub Check: testing-windows (3)
- GitHub Check: testing (1, 3.12, bash)
- GitHub Check: testing-windows (2)
- GitHub Check: testing (1, 3.11, bash)
- GitHub Check: testing-windows (1)
🔇 Additional comments (25)
tests/test_script_xsh/Snakefile (1)
1-13: Looks good!This Snakefile is well-structured and follows standard Snakemake conventions. It defines a simple workflow with two rules:
allwhich specifies the final output, andtest_xonshwhich uses a conda environment and executes a script.tests/test_conda_python_3_7_script/Snakefile (1)
7-7: Appropriate script file name update.The change from "test_script.py" to "test_script_python_3_7.py" clearly indicates the Python version to be used, which improves clarity.
tests/test_conda_run/Snakefile (1)
1-14: Well-structured test rule with helpful comments.The rule is properly defined with appropriate output, conda, params and run directives. The comments clearly explain that this pattern is for testing purposes only and provides guidance about what should be used in real workflows.
docs/snakefiles/deployment.rst (3)
425-425: Correct capitalization of Docker.The change properly capitalizes "Docker" as it's a proper noun/brand name.
451-454: Valuable clarification about container isolation behavior.This addition provides important information about how environment variable handling differs between container engines (Docker vs Apptainer/Singularity), which will help users understand and troubleshoot container behavior.
640-645: Enhanced documentation about global workflow dependencies.The additions to this section improve the documentation about global dependencies by including more examples and clarifying the relationship with storage plugins.
Also applies to: 662-667
snakemake/script/__init__.py (4)
23-26: Appropriate imports added for newReportHreffunctionality.The new imports are correctly added to support the
ReportHrefclass implementation:Dictfor type hints,urllib.parsefor URL encoding, andTypeVarfor creating a self-reference type for Python <3.11 compatibility.
42-42: Import ofget_report_idis appropriately added.This import is necessary for the report href functionality to generate unique identifiers for reports.
52-54: Good backwards compatibility approach withTypeVar.Using a
TypeVarbound to theReportHrefclass provides compatibility with Python versions before 3.11 wheretyping.Selfisn't available, ensuring the class works across different Python versions.
133-141: Well-documented method for generating report links.The
report_hrefmethod provides a convenient way to create hyperlinks to paths within the report context, with good documentation explaining its purpose and the possibility of extending to child paths.snakemake/workflow.py (2)
57-57: Appropriate addition ofMissingOutputFileCachePathExceptionto imports.This exception is now properly imported to handle cases where cache path is missing in the
_prepare_dagmethod.
689-708: Improved error handling for output file cache initialization.This change enhances the robustness of the code by gracefully handling missing cache path scenarios. Instead of failing with an exception, it now catches the
MissingOutputFileCachePathExceptionand provides a helpful warning message to the user with clear instructions on how to resolve the issue.The message includes the specific environment variable that needs to be set (
caching.LOCATION_ENVVAR), showing good user experience considerations.docs/snakefiles/rules.rst (2)
923-927: Enhanced GPU Resource Definitions:
The new definitions forgpu,gpu_manufacturer, andgpu_modelsignificantly clarify how GPU resources should be specified. Please double-check that the renaming from “gpus” to “gpu” is propagated consistently across all examples and explanatory text in the document.
972-973: Updated Resource Scopes Example:
The revised example using
$ snakemake --set-resource-scopes gpu=global disk_mb=local
improves clarity regarding resource categorization. Please verify that this example aligns correctly with the new GPU resource naming conventions throughout the documentation.docs/snakefiles/reporting.rst (10)
7-10: Updated Report Generation Overview:
The introduction now clearly distinguishes between self-contained HTML reports and ZIP archives, outlining their purposes and benefits since Snakemake 5.1. This update provides a succinct overview of report capabilities. Consider adding links to more detailed examples or further documentation if available.
13-16: New "Including Results in a Report" Section:
The addition of this section is valuable—it outlines how output files must be marked with thereportflag, specifying captions and categories to guide report assembly. This extra guidance will help users better annotate their workflows for reporting purposes.
112-127: Enhanced Example with File Labels:
In the modified example for ruleb, the inclusion of alabelsdictionary adds a higher level of abstraction by enabling descriptive names instead of raw filenames. Please ensure that the placeholder (e.g.{model}) is rendered correctly at runtime and consider including a brief note on expected value types if needed.
142-147: Linking from Captions Clarification:
The new subsection on linking from captions provides users with a clear procedure to reference various panels and report items directly from their reStructuredText documents. Including a brief warning about common pitfalls (such as ensuring proper hyperlink targets) might further advise users.
153-160: Linking Within Report Results:
Explaining the usage ofsnakemake.report_href()to generate internal report links is an excellent enhancement. The added details, including how to extend links with child paths, URL arguments, or anchors, make this feature more accessible.
247-253: Report Rendering Details:
The "Rendering reports" section now explains that all metadata (e.g. runtime statistics) are automatically collected from the.snakemakedirectory. This clarification improves transparency on how the report content is generated.
255-268: Self-Contained HTML Report Instructions:
The instructions for creating a self-contained HTML report (usingsnakemake --reportand specifying a custom name with--report my_report.html) are clear and straightforward. Please confirm that the reference (snakefiles-self_contained_html_file) is correctly implemented elsewhere and that any file size or embedding limitations are documented if necessary.
275-290: Self-Contained ZIP Archive Option:
The new section on generating a self-contained ZIP archive report is comprehensive—it explains the use case for more complex reports and how to generate them by specifying a.zipfile. This will be especially useful for large workflows with many samples. Just ensure that instructions regarding unpacking and identifying thereport.htmlentry point are fully accurate.
292-302: Partial Reports Feature:
Allowing users to generate reports for only specific targets (e.g. by runningsnakemake fig1.svg --report report-short.html) is a practical addition for iterative development and debugging. It might be helpful to briefly mention any restrictions or interactions with other reporting options.
306-318: Custom Report Layout:
The section on custom layouts explains how to apply a custom stylesheet to a report for branding or aesthetic purposes. The provided example and references to downloadable source files are especially valuable. This addition offers significant flexibility for advanced users who wish to tailor the report’s appearance.snakemake/cli.py (1)
957-966: Well-crafted documentation improvementThe updated help text for the
--reportargument provides much clearer guidance on how to use this feature. It effectively explains the distinction between using.htmlfiles for smaller datasets versus.zipfiles for larger ones, and includes helpful instructions for accessing the report after unpacking the archive.
johanneskoester
left a comment
There was a problem hiding this comment.
The only thing that is missing here is to mention that both polars and pandas work for validate here: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#validation
|
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 1
🔭 Outside diff range comments (2)
.github/workflows/main.yml (2)
94-98:⚠️ Potential issueDuplicate and Malformed AppArmor Restriction Step
There are two steps named "Disable apparmor namespace restrictions for apptainer," and the second instance (lines 96–98) contains a redirection (
> /etc/sysctl.d/90-disable-userns-restrictions.conf') without a preceding command. This looks like a syntax error and duplication. Please consolidate these steps and ensure the intended command is complete. For example, you might want to create or overwrite the configuration file with valid shell syntax.
147-154:⚠️ Potential issueDuplicate 'with:' Block in Artifact Upload Step
In the "Upload report" step, there are two separate
with:blocks. This duplication can lead to YAML parsing issues. Remove the redundant block and merge the parameters so that only onewith:block is present. For example:- with: - name: report-${{ matrix.py_ver }}.zip - with: - name: report-${{ matrix.py_ver }}.zip - path: tests/test_report/report.zip + with: + name: report-${{ matrix.py_ver }}.zip + path: tests/test_report/report.zip
🧹 Nitpick comments (2)
.github/workflows/main.yml (2)
127-133: Duplicate Test File Entries in Pytest CommandThe pytest command lists
tests/test_io.pyandtests/test_schema.pytwice. Consolidating these duplicate entries would streamline test execution and prevent redundancy.
184-190: Repeated In-Place File Editing in Windows TestingWithin the Windows testing steps (lines 184–190), the script processes the file
test-environment.ymltwice using near-identical loops. If this repetition is unintentional, consolidating the loops will reduce redundant file operations. If it is intentional, consider adding a comment to clarify the purpose.
🛑 Comments failed to post (1)
snakemake/ioutils/input.py (1)
12-30:
⚠️ Potential issueFix issues with lambda, error handling, and undefined WorkflowError.
There are a few issues that need addressing:
- Using a lambda for
fix_file_nameis discouraged (E731)- WorkflowError is undefined (F821)
- The exception isn't properly chained (B904)
-def extract_checksum(infile, **kwargs): +from snakemake.exceptions import WorkflowError + +def extract_checksum(infile, **kwargs): try: import pandas as pd - fix_file_name = lambda x: x.removeprefix("./") + def fix_file_name(x): + return x.removeprefix("./") + return ( pd.read_csv( infile, sep=" ", header=None, engine="python", converters={1: fix_file_name}, ) .set_index(1) .loc[fix_file_name(kwargs.get("file"))] .item() ) except ImportError: - raise WorkflowError("Pandas is required to extract checksum from file.") + raise WorkflowError("Pandas is required to extract checksum from file.") from None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.from snakemake.exceptions import WorkflowError def extract_checksum(infile, **kwargs): try: import pandas as pd def fix_file_name(x): return x.removeprefix("./") return ( pd.read_csv( infile, sep=" ", header=None, engine="python", converters={1: fix_file_name}, ) .set_index(1) .loc[fix_file_name(kwargs.get("file"))] .item() ) except ImportError: raise WorkflowError("Pandas is required to extract checksum from file.") from None🧰 Tools
🪛 Ruff (0.8.2)
16-16: Do not assign a
lambdaexpression, use adefRewrite
fix_file_nameas adef(E731)
30-30: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
30-30: Undefined name
WorkflowError(F821)
🤖 I have created a release *beep* *boop* --- ## [9.0.0](v8.30.0...v9.0.0) (2025-03-14) ### ⚠ BREAKING CHANGES * Logging refactor & add LoggerPluginInterface ([#3107](#3107)) ### Features * [#3412](#3412) - keep shadow folder of failed job if --keep-incomplete flag is set. ([#3430](#3430)) ([22978c3](22978c3)) * add flag --report-after-run to automatically generate the report after a successfull workflow run ([#3428](#3428)) ([b0a7f03](b0a7f03)) * add flatten function to IO utils ([#3424](#3424)) ([67fa392](67fa392)) * add helper functions to parse input files ([#2918](#2918)) ([63e45a7](63e45a7)) * Add option to print redacted file names ([#3089](#3089)) ([ba4d264](ba4d264)) * add support for validation of polars dataframe and lazyframe ([#3262](#3262)) ([c7473a6](c7473a6)) * added support for rendering dag with mermaid js ([#3409](#3409)) ([7bf8381](7bf8381)) * adding --replace-workflow-config to fully replace workflow configs (from config: directive) with --configfile, instead of merging them ([#3381](#3381)) ([47504a0](47504a0)) * Dynamic module name ([#3401](#3401)) ([024dc32](024dc32)) * Enable saving and reloading IOCache object ([#3386](#3386)) ([c935953](c935953)) * files added in rule params with workflow.source_path will be available in used containers ([#3385](#3385)) ([a6e45bf](a6e45bf)) * Fix keep_local in storage directive and more freedom over remote retrieval behaviour ([#3410](#3410)) ([67b4739](67b4739)) * inherit parameters of use rule and extend/replace individual items them when using 'with' directive ([#3365](#3365)) ([93e4b92](93e4b92)) * Logging refactor & add LoggerPluginInterface ([#3107](#3107)) ([86f1d6e](86f1d6e)) * Maximal file size for checksums ([#3368](#3368)) ([b039f8a](b039f8a)) * Modernize package configuration using Pixi ([#3369](#3369)) ([77992d8](77992d8)) * multiext support for named input/output ([#3372](#3372)) ([05e1378](05e1378)) * optionally auto-group jobs via temp files in case of remote execution ([#3378](#3378)) ([cc9bba2](cc9bba2)) ### Bug Fixes * `--delete-all-output` ignores `--dry-run` ([#3265](#3265)) ([23fef82](23fef82)) * 3342 faster touch runs and warning messages for non-existing files ([#3398](#3398)) ([cd9c3c3](cd9c3c3)) * add default value to max-jobs-per-timespan ([#3043](#3043)) ([2959abe](2959abe)) * checkpoints inside modules are overwritten ([#3359](#3359)) ([fba3ac7](fba3ac7)) * Convert Path to IOFile ([#3405](#3405)) ([c58684c](c58684c)) * Do not perform storage object cleanup with --keep-storage-local-copies set ([#3358](#3358)) ([9a6d14b](9a6d14b)) * edgecases of source deployment in case of remote execution ([#3396](#3396)) ([5da13be](5da13be)) * enhance error message formatting for strict DAG-building mode ([#3376](#3376)) ([a1c39ee](a1c39ee)) * fix bug in checkpoint handling that led to exceptions in case checkpoint output was missing upon rerun ([#3423](#3423)) ([8cf4a2f](8cf4a2f)) * force check all required outputs ([#3341](#3341)) ([495a4e7](495a4e7)) * group job formatting ([#3442](#3442)) ([f0b10a3](f0b10a3)) * in remote jobs, upload storage in topological order such that modification dates are preserved (e.g. in case of group jobs) ([#3377](#3377)) ([eace08f](eace08f)) * only skip eval when resource depends on input ([#3374](#3374)) ([4574c92](4574c92)) * Prevent execution of conda in apptainer when not explicitly requested in software deployment method ([#3388](#3388)) ([c43c5c0](c43c5c0)) * print filenames with quotes around them in RuleException ([#3269](#3269)) ([6baeda5](6baeda5)) * Re-evaluation of free resources ([#3399](#3399)) ([6371293](6371293)) * ReadTheDocs layout issue due to src directory change ([#3419](#3419)) ([695b127](695b127)) * robustly escaping quotes in generated bash scripts (v2) ([#3297](#3297)) ([#3389](#3389)) ([58720bd](58720bd)) * Show apptainer image URL in snakemake report ([#3407](#3407)) ([45f0450](45f0450)) * Update ReadTheDocs configuration for documentation build to use Pixi ([#3433](#3433)) ([3f227a6](3f227a6)) ### Documentation * Add pixi setup instructions to general use tutorial ([#3382](#3382)) ([115e81b](115e81b)) * fix contribution section heading levels, fix docs testing setup order ([#3360](#3360)) ([051dc53](051dc53)) * fix link to github.com/snakemake/poetry-snakemake-plugin ([#3436](#3436)) ([ec6d97c](ec6d97c)) * fix quoting ([#3394](#3394)) ([b40f599](b40f599)) * fix rerun-triggers default ([#3403](#3403)) ([4430e23](4430e23)) * fix typo 'safe' -> 'save' ([#3384](#3384)) ([7755861](7755861)) * mention code formatting in the contribution section ([#3431](#3431)) ([e8682b7](e8682b7)) * remove duplicated 'functions'. ([#3356](#3356)) ([7c595db](7c595db)) * update broken links documentation ([#3437](#3437)) ([e3d0d88](e3d0d88)) * Updating contributing guidelines with new pixi dev setup ([#3415](#3415)) ([8e95a12](8e95a12)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: snakemake-bot <snakemake-bot-admin@googlegroups.com>
…nakemake#3262) <!--Add a description of your PR here--> Implement validation for polars DataFrame and LazyFrame. Re-factoring the setting of default values. ### QC <!-- Make sure that you can tick the boxes below. --> * [x] The PR contains a test case for the changes or the changes are already covered by an existing test case. * [x] The documentation (`docs/`) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake). <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced sample data validation now supports both Pandas and Polars data frames for improved reliability and performance. - Introduced new methods for reading sample data using Polars, expanding data handling options. - Added support for executing Xonsh scripts within Snakemake workflows. - New rule added for running Python scripts with Conda environments. - New functionality for generating self-contained HTML reports, including default statistics and user-specified results. - New functions added for parsing input files and extracting checksums. - **Bug Fixes** - Improved error handling for validation failures, providing more specific error messages. - **Documentation** - Updated the sample metadata schema with new fields for replicate count and tissue origin, alongside a refined description for sample condition. - Clarified usage of conda environments and apptainer integration within Snakemake workflows. - Expanded guidance on generating, customizing, and sharing reports in Snakemake. - Added documentation for integrating Xonsh scripts into Snakemake rules. - Updated help text for the `--keep-storage-local-copies` argument to enhance clarity and usability. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
🤖 I have created a release *beep* *boop* --- ## [9.0.0](snakemake/snakemake@v8.30.0...v9.0.0) (2025-03-14) ### ⚠ BREAKING CHANGES * Logging refactor & add LoggerPluginInterface ([snakemake#3107](snakemake#3107)) ### Features * [snakemake#3412](snakemake#3412) - keep shadow folder of failed job if --keep-incomplete flag is set. ([snakemake#3430](snakemake#3430)) ([22978c3](snakemake@22978c3)) * add flag --report-after-run to automatically generate the report after a successfull workflow run ([snakemake#3428](snakemake#3428)) ([b0a7f03](snakemake@b0a7f03)) * add flatten function to IO utils ([snakemake#3424](snakemake#3424)) ([67fa392](snakemake@67fa392)) * add helper functions to parse input files ([snakemake#2918](snakemake#2918)) ([63e45a7](snakemake@63e45a7)) * Add option to print redacted file names ([snakemake#3089](snakemake#3089)) ([ba4d264](snakemake@ba4d264)) * add support for validation of polars dataframe and lazyframe ([snakemake#3262](snakemake#3262)) ([c7473a6](snakemake@c7473a6)) * added support for rendering dag with mermaid js ([snakemake#3409](snakemake#3409)) ([7bf8381](snakemake@7bf8381)) * adding --replace-workflow-config to fully replace workflow configs (from config: directive) with --configfile, instead of merging them ([snakemake#3381](snakemake#3381)) ([47504a0](snakemake@47504a0)) * Dynamic module name ([snakemake#3401](snakemake#3401)) ([024dc32](snakemake@024dc32)) * Enable saving and reloading IOCache object ([snakemake#3386](snakemake#3386)) ([c935953](snakemake@c935953)) * files added in rule params with workflow.source_path will be available in used containers ([snakemake#3385](snakemake#3385)) ([a6e45bf](snakemake@a6e45bf)) * Fix keep_local in storage directive and more freedom over remote retrieval behaviour ([snakemake#3410](snakemake#3410)) ([67b4739](snakemake@67b4739)) * inherit parameters of use rule and extend/replace individual items them when using 'with' directive ([snakemake#3365](snakemake#3365)) ([93e4b92](snakemake@93e4b92)) * Logging refactor & add LoggerPluginInterface ([snakemake#3107](snakemake#3107)) ([86f1d6e](snakemake@86f1d6e)) * Maximal file size for checksums ([snakemake#3368](snakemake#3368)) ([b039f8a](snakemake@b039f8a)) * Modernize package configuration using Pixi ([snakemake#3369](snakemake#3369)) ([77992d8](snakemake@77992d8)) * multiext support for named input/output ([snakemake#3372](snakemake#3372)) ([05e1378](snakemake@05e1378)) * optionally auto-group jobs via temp files in case of remote execution ([snakemake#3378](snakemake#3378)) ([cc9bba2](snakemake@cc9bba2)) ### Bug Fixes * `--delete-all-output` ignores `--dry-run` ([snakemake#3265](snakemake#3265)) ([23fef82](snakemake@23fef82)) * 3342 faster touch runs and warning messages for non-existing files ([snakemake#3398](snakemake#3398)) ([cd9c3c3](snakemake@cd9c3c3)) * add default value to max-jobs-per-timespan ([snakemake#3043](snakemake#3043)) ([2959abe](snakemake@2959abe)) * checkpoints inside modules are overwritten ([snakemake#3359](snakemake#3359)) ([fba3ac7](snakemake@fba3ac7)) * Convert Path to IOFile ([snakemake#3405](snakemake#3405)) ([c58684c](snakemake@c58684c)) * Do not perform storage object cleanup with --keep-storage-local-copies set ([snakemake#3358](snakemake#3358)) ([9a6d14b](snakemake@9a6d14b)) * edgecases of source deployment in case of remote execution ([snakemake#3396](snakemake#3396)) ([5da13be](snakemake@5da13be)) * enhance error message formatting for strict DAG-building mode ([snakemake#3376](snakemake#3376)) ([a1c39ee](snakemake@a1c39ee)) * fix bug in checkpoint handling that led to exceptions in case checkpoint output was missing upon rerun ([snakemake#3423](snakemake#3423)) ([8cf4a2f](snakemake@8cf4a2f)) * force check all required outputs ([snakemake#3341](snakemake#3341)) ([495a4e7](snakemake@495a4e7)) * group job formatting ([snakemake#3442](snakemake#3442)) ([f0b10a3](snakemake@f0b10a3)) * in remote jobs, upload storage in topological order such that modification dates are preserved (e.g. in case of group jobs) ([snakemake#3377](snakemake#3377)) ([eace08f](snakemake@eace08f)) * only skip eval when resource depends on input ([snakemake#3374](snakemake#3374)) ([4574c92](snakemake@4574c92)) * Prevent execution of conda in apptainer when not explicitly requested in software deployment method ([snakemake#3388](snakemake#3388)) ([c43c5c0](snakemake@c43c5c0)) * print filenames with quotes around them in RuleException ([snakemake#3269](snakemake#3269)) ([6baeda5](snakemake@6baeda5)) * Re-evaluation of free resources ([snakemake#3399](snakemake#3399)) ([6371293](snakemake@6371293)) * ReadTheDocs layout issue due to src directory change ([snakemake#3419](snakemake#3419)) ([695b127](snakemake@695b127)) * robustly escaping quotes in generated bash scripts (v2) ([snakemake#3297](snakemake#3297)) ([snakemake#3389](snakemake#3389)) ([58720bd](snakemake@58720bd)) * Show apptainer image URL in snakemake report ([snakemake#3407](snakemake#3407)) ([45f0450](snakemake@45f0450)) * Update ReadTheDocs configuration for documentation build to use Pixi ([snakemake#3433](snakemake#3433)) ([3f227a6](snakemake@3f227a6)) ### Documentation * Add pixi setup instructions to general use tutorial ([snakemake#3382](snakemake#3382)) ([115e81b](snakemake@115e81b)) * fix contribution section heading levels, fix docs testing setup order ([snakemake#3360](snakemake#3360)) ([051dc53](snakemake@051dc53)) * fix link to github.com/snakemake/poetry-snakemake-plugin ([snakemake#3436](snakemake#3436)) ([ec6d97c](snakemake@ec6d97c)) * fix quoting ([snakemake#3394](snakemake#3394)) ([b40f599](snakemake@b40f599)) * fix rerun-triggers default ([snakemake#3403](snakemake#3403)) ([4430e23](snakemake@4430e23)) * fix typo 'safe' -> 'save' ([snakemake#3384](snakemake#3384)) ([7755861](snakemake@7755861)) * mention code formatting in the contribution section ([snakemake#3431](snakemake#3431)) ([e8682b7](snakemake@e8682b7)) * remove duplicated 'functions'. ([snakemake#3356](snakemake#3356)) ([7c595db](snakemake@7c595db)) * update broken links documentation ([snakemake#3437](snakemake#3437)) ([e3d0d88](snakemake@e3d0d88)) * Updating contributing guidelines with new pixi dev setup ([snakemake#3415](snakemake#3415)) ([8e95a12](snakemake@8e95a12)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: snakemake-bot <snakemake-bot-admin@googlegroups.com>



Implement validation for polars DataFrame and LazyFrame.
Re-factoring the setting of default values.
QC
docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).Summary by CodeRabbit
New Features
Bug Fixes
Documentation
--keep-storage-local-copiesargument to enhance clarity and usability.