feat: Maximal file size for checksums#3368
Conversation
📝 WalkthroughWalkthroughThe changes integrate a new configuration parameter ( Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant API
participant Workflow
participant DAG
participant Persistence
participant IO
User->>CLI: Provide --max-checksum-file-size value
CLI->>CLI: Call parse_size_in_bytes to convert value
CLI->>API: Pass max_checksum_file_size in API call
API->>Workflow: Forward configuration parameter
Workflow->>DAG: Initialize DAG with max_checksum_file_size from DAGSettings
DAG->>IO: Invoke checksum methods with explicit max_checksum_file_size threshold
DAG->>Persistence: Use max_checksum_file_size in file evaluation
Persistence->>IO: Call IO checksum functions with provided threshold
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (30)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
snakemake/persistence.py (1)
1-787:⚠️ Potential issueNote: Black formatting issue detected.
The CI pipeline reports a Black formatting issue. While we don't need to fix formatting according to the coding guidelines, you should be aware of this issue.
🧰 Tools
🪛 Ruff (0.8.2)
272-272: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
279-281: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
538-538: Use of
functools.lru_cacheorfunctools.cacheon methods can lead to memory leaks(B019)
546-546: Use of
functools.lru_cacheorfunctools.cacheon methods can lead to memory leaks(B019)
551-551: Use of
functools.lru_cacheorfunctools.cacheon methods can lead to memory leaks(B019)
571-571: Use of
functools.lru_cacheorfunctools.cacheon methods can lead to memory leaks(B019)
609-609: Use of
functools.lru_cacheorfunctools.cacheon methods can lead to memory leaks(B019)
618-618: Use of
functools.lru_cacheorfunctools.cacheon methods can lead to memory leaks(B019)
663-663: Use of
functools.lru_cacheorfunctools.cacheon methods can lead to memory leaks(B019)
673-673: Local variable
eis assigned to but never usedRemove assignment to unused variable
e(F841)
🪛 GitHub Actions: CI
[error] 1-1: Black formatting check failed. 2 files would be reformatted, 140 files would be left unchanged.
🧹 Nitpick comments (5)
snakemake/workflow.py (1)
1812-1814: Follow Python's style guide for boolean comparisonsReplace direct comparison to
Falsewith a truthiness check to follow Python's style guide.-if not ruleinfo.template_engine and ruleinfo.container_img != False: +if not ruleinfo.template_engine and ruleinfo.container_img: # skip rules with template_engine directive or empty image rule.container_img = self.global_container_img🧰 Tools
🪛 Ruff (0.8.2)
1812-1812: Avoid inequality comparisons to
False; useif ruleinfo.container_img:for truth checksReplace with
ruleinfo.container_img(E712)
tests/test_conda_run/test_script_run.py (1)
1-4: Consider adding error handling to this script.The script currently has no error handling for potential issues like NumPy import failure or file operation errors. While this is a simple test script, adding basic error handling would make it more robust.
import numpy -with open("test.txt", "w") as f: - f.write(str(numpy.log2(8))) +try: + with open("test.txt", "w") as f: + f.write(str(numpy.log2(8))) +except Exception as e: + print(f"Error: {e}") + exit(1)docs/snakefiles/rules.rst (1)
1592-1611: Added Xonsh Script Integration SectionThis newly added section introduces documentation on using Xonsh scripts (with the “.xsh” extension) within rules. The code block and accompanying link (https://xon.sh) clearly inform users that Xonsh, being a superset of Python, can be used similarly to a Python script while leveraging additional shell primitives.
Consider a minor wording adjustment: instead of “script:” maybe clarify that rules point to an Xonsh script file. Additionally, check that the formatting (indentation and spacing) is consistent with the surrounding snippets.
docs/snakefiles/deployment.rst (2)
288-294: Clarified Conda Environment Behavior with therunDirectiveThe updated note now clearly explains that conda environments are available for use with the
shell,script,notebook,wrapper, and notably therundirective. It emphasizes that because arunblock executes in the same process as Snakemake, the conda directive only affects shell function calls within that block. This clarification is very helpful for users who might otherwise expect full isolation even inrunblocks.A minor suggestion is to standardize phrasing by replacing “from the within
runscript” with “within the run block.”
464-468: Clarified Apptainer Integration with therunDirectiveSimilarly, the note regarding apptainer integration now specifies that when using the
rundirective, containerization only impacts shell function calls executed from inside the run block. This helps set the correct expectations for users leveraging both containerization and therundirective.A slight wording improvement—using “within the run block” rather than “from within the run script”—could enhance overall clarity and consistency with the conda note.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
tests/test_script_xsh/expected-results/test.outis excluded by!**/*.out
📒 Files selected for processing (23)
snakemake/cli.py(2 hunks)snakemake/dag.py(4 hunks)snakemake/io.py(2 hunks)snakemake/settings/types.py(1 hunks)snakemake/io.py(2 hunks)snakemake/persistence.py(2 hunks)snakemake/cli.py(1 hunks)docs/snakefiles/deployment.rst(2 hunks)docs/snakefiles/reporting.rst(5 hunks)docs/snakefiles/rules.rst(1 hunks)snakemake/cli.py(1 hunks)snakemake/script/__init__.py(3 hunks)snakemake/workflow.py(1 hunks)tests/test_conda_python_3_7_script/Snakefile(1 hunks)tests/test_conda_python_3_7_script/test_script_python_3_7.py(1 hunks)tests/test_conda_run/Snakefile(1 hunks)tests/test_conda_run/expected-results/test.txt(1 hunks)tests/test_conda_run/test_python_env.yaml(1 hunks)tests/test_conda_run/test_script_run.py(1 hunks)tests/test_script_xsh/Snakefile(1 hunks)tests/test_script_xsh/envs/xonsh.yaml(1 hunks)tests/test_script_xsh/scripts/test.xsh(1 hunks)tests/tests_using_conda.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest ...
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
tests/test_conda_run/test_script_run.pytests/test_conda_python_3_7_script/test_script_python_3_7.pytests/tests_using_conda.pysnakemake/settings/types.pysnakemake/persistence.pysnakemake/cli.pysnakemake/workflow.pysnakemake/io.pysnakemake/dag.pysnakemake/script/__init__.py
🪛 GitHub Actions: CI
snakemake/persistence.py
[error] 1-1: Black formatting check failed. 2 files would be reformatted, 140 files would be left unchanged.
snakemake/dag.py
[error] 1-1: Black formatting check failed. 2 files would be reformatted, 140 files would be left unchanged.
🪛 Ruff (0.8.2)
snakemake/workflow.py
1812-1812: Avoid inequality comparisons to False; use if ruleinfo.container_img: for truth checks
Replace with ruleinfo.container_img
(E712)
🔇 Additional comments (36)
snakemake/workflow.py (4)
1779-1787: Good addition of centralized validation functionThe introduction of the
check_may_use_software_deploymentfunction centralizes the validation logic for software deployment directives, improving code maintainability and reducing duplication.
1794-1794: Appropriate use of centralized validationGood refactoring to replace inline checks with the new centralized validation function for environment modules.
1800-1800: Appropriate use of centralized validationGood refactoring to replace inline checks with the new centralized validation function for conda environments.
1808-1808: Appropriate use of centralized validationGood refactoring to replace inline checks with the new centralized validation function for container/singularity directives.
tests/test_conda_run/expected-results/test.txt (1)
1-1: New test expected result file looks goodThis file contains the expected output value (3.0) for the test case that calculates the logarithm base 2 of 8 using numpy.
tests/test_script_xsh/scripts/test.xsh (1)
1-2: Good addition of Xonsh script for testingThe new Xonsh script correctly writes "Hello, world!" to the output file specified in the snakemake.output list. This expands test coverage to include Xonsh script execution capabilities.
tests/test_conda_python_3_7_script/test_script_python_3_7.py (1)
4-4: Minor style change from single to double quotesThe change from single quotes to double quotes is a style change that doesn't affect functionality. Both formats are valid for string literals in Python.
tests/test_script_xsh/envs/xonsh.yaml (1)
1-5: LGTM: Proper conda environment specification for xonsh.The environment setup for xonsh includes the necessary channels and dependency required for testing xonsh script execution.
tests/test_conda_run/test_python_env.yaml (1)
1-6: LGTM: Appropriate conda environment for Python with NumPy.The environment definition includes the necessary channels and dependencies for the Python test script that uses NumPy.
tests/test_conda_python_3_7_script/Snakefile (1)
7-7: LGTM: Appropriate script name update.The updated script name better indicates the Python version specificity of the test script.
tests/test_script_xsh/Snakefile (1)
1-12:Details
✅ Verification successful
LGTM: Well-formed Snakefile for testing xonsh script integration.
The Snakefile correctly defines a workflow to test the execution of xonsh scripts using conda environments.
One observation: Make sure the
scripts/test.xshfile exists at this location, as this Snakefile depends on it but it wasn't included in the files to review.
🏁 Script executed:
#!/bin/bash # Check if the xonsh script file exists if [ -f "tests/test_script_xsh/scripts/test.xsh" ]; then echo "File exists: tests/test_script_xsh/scripts/test.xsh" cat "tests/test_script_xsh/scripts/test.xsh" else echo "File not found: tests/test_script_xsh/scripts/test.xsh" fiLength of output: 255
File existence confirmed for xonsh script dependency.
The Snakefile is set up correctly for testing the xonsh integration using a conda environment. We verified that the referenced script exists at the expected relative path (i.e., within
tests/test_script_xsh/scripts/test.xsh), and its content is as anticipated. No further changes are required.snakemake/settings/types.py (1)
205-205: Good addition of the max_checksum_file_size parameter.The new parameter introduces a useful limit on file sizes for checksum computation, which aligns with the PR objective to allow specifying a maximum file size for checksums during DAG construction. The default value of 1MB seems reasonable.
tests/test_conda_run/Snakefile (1)
1-13: Well-structured test rule with helpful comments.The rule effectively tests Conda integration for running Python scripts. The comments clearly explain that this approach is for testing purposes and provide guidance about using the script directive in actual workflows.
tests/tests_using_conda.py (2)
309-316: Good test coverage for Xonsh script execution.This test appropriately handles platform specificity with the
@skip_on_windowsdecorator and follows the established testing pattern in the file.
318-321: Test aligns with PR objectives for Conda run functionality.This test function appropriately tests the new Conda run feature and follows the established pattern of other test functions in the file.
snakemake/persistence.py (2)
133-133: Appropriate integration of max_checksum_file_size parameter.The new parameter is correctly initialized from the DAG settings, ensuring consistency across the codebase.
330-330: Proper usage of max_checksum_file_size in checksum calculation.The parameter is correctly passed to the checksum method for input files, implementing the file size limit functionality.
snakemake/cli.py (2)
957-965: Enhanced report documentation looks good!The improved help text for the
--reportoption provides clearer guidance about report generation options, explaining the difference between embedded HTML reports for smaller datasets and ZIP archives for larger ones.
1303-1309: New parameter implementation for maximum checksum file size works as expected.This new parameter implements the core feature mentioned in the PR objectives. Setting a 1MB default for maximum file size for checksum computation is a reasonable choice that balances performance and reliability.
snakemake/dag.py (4)
147-147: Good addition of themax_checksum_file_sizeparameter.Initializing this attribute from the workflow dag settings allows for user configuration of maximum file size for checksum verification, which will improve performance when working with large files.
650-650: Updated checksum verification to respect file size limits.This change properly passes the
max_checksum_file_sizeparameter to the checksum verification while maintaining theforce=Trueparameter, ensuring backward compatibility while adding the new file size limit feature.
1231-1231: Enhanced checksum eligibility check with file size limit.The code now correctly uses the new
max_checksum_file_sizeparameter to determine if a file is eligible for checksum calculation, which will prevent unnecessary checksum calculations for large files.
1245-1245: Consistent implementation of the file size limit for checksums.The
is_same_checksumcall now includes themax_checksum_file_sizeparameter, maintaining consistency with the other checksum verification calls in the codebase.snakemake/script/__init__.py (4)
1600-1602: Well-implemented Xonsh script support.The new
XonshScriptclass properly inherits fromPythonScriptand appropriately overrides just theexecute_scriptmethod to use the Xonsh shell instead of Python for execution. This implementation is clean and follows the established pattern used by other script type handlers in the codebase.
1667-1668: Good extension of language detection.The extension of the
get_languagefunction to recognize.xshfiles maintains consistency with how other file extensions are detected. This change properly integrates with the existing language detection logic.
1732-1732: Correctly registered XonshScript executor.The Xonsh script type is properly registered in the
exec_classdictionary, allowing the correct executor to be used for Xonsh scripts.
1736-1737: Well-updated error message.The error message is appropriately updated to include Xonsh as a supported script type, maintaining user-friendly error reporting.
snakemake/io.py (3)
599-605: Proper removal of default threshold parameter.The default value for the
thresholdparameter was removed fromis_checksum_eligible, which aligns with the PR's objective to make this value configurable through themax_checksum_file_sizeparameter. This change ensures that threshold values must be explicitly provided, improving configurability.
607-624: Consistent threshold parameter handling.The default value for the
thresholdparameter was properly removed from thechecksummethod as well, maintaining consistency with the changes tois_checksum_eligible. The method documentation is still accurate and clearly explains the purpose of the parameters.
626-632: Completed the threshold parameter standardization.The default value for the
thresholdparameter was also removed fromis_same_checksum, completing the standardization of threshold parameter handling across all checksum-related methods. This ensures consistent and explicit threshold specification.docs/snakefiles/reporting.rst (6)
7-10: Improved report introduction with clarity on output formats.The enhanced introduction provides a clearer explanation of what reports include (runtime statistics, provenance information, workflow topology) and introduces the distinction between single HTML files for smaller reports and ZIP archives for more complex ones. This helps users understand the reporting capabilities better.
13-17: Well-structured section heading for report inclusion.Adding a dedicated section with proper ReStructuredText anchors for "Including results in a report" improves the document organization and makes it easier to link to this specific section from elsewhere in the documentation.
257-286: Clear instructions for HTML report generation.This new section provides detailed and clear instructions for generating self-contained HTML reports, including how to specify custom report names. The note about HTML files being suitable only for smaller reports is important guidance for users to set appropriate expectations.
287-303: Comprehensive explanation of ZIP archive reports.The addition of instructions for generating self-contained ZIP archive reports is valuable, especially the explanation of when to use this format (for complex reports with many outputs) and how to navigate the resulting archive. This will help users choose the appropriate report format for their needs.
304-317: Useful information on partial reports.The new section on partial reports explains how to generate reports for specific targets, which is particularly useful for large workflows or those with multiple alternative targets. The explanation of use cases helps users understand when this feature might be beneficial.
318-330: Well-documented custom layout capabilities.The section on custom layouts provides clear instructions for using custom stylesheets, with helpful examples and downloadable resources. This enables users to brand their reports appropriately, which is especially valuable for institutional use.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
test-environment.yml (2)
7-7: Clarify thesetuptoolsDependency VersionThe new inclusion of
setuptoolsin the top-level dependencies is appropriate for ensuring that build and installation processes have the necessary tooling. Consider specifying a version constraint if your project relies on features or compatibility fixes from a specific version.
61-62: Elevatecwltoolandcwl-utilsto Top-Level DependenciesMoving
cwltoolandcwl-utilsfrom the nestedpipsection to the top-level dependencies enhances consistency and can simplify dependency resolution. Please verify that these packages are available in the Conda channels (e.g., conda-forge or bioconda) as expected. This change aligns well with the overall objective of standardizing dependency management.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
pyproject.tomlis excluded by!pyproject.toml
📒 Files selected for processing (1)
test-environment.yml(2 hunks)
johanneskoester
left a comment
There was a problem hiding this comment.
Thanks a lot! One minor thing below.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
snakemake/workflow.py (1)
1812-1813: Avoid inequality comparison toFalseAccording to Python style conventions, it's better to use truthiness checks rather than explicit comparisons to
False.- if not ruleinfo.template_engine and ruleinfo.container_img != False: + if not ruleinfo.template_engine and ruleinfo.container_img is not False:Using
is not Falseis more appropriate here since you're specifically checking for the False constant rather than falsy values in general.🧰 Tools
🪛 Ruff (0.8.2)
1813-1813: Avoid inequality comparisons to
False; useif ruleinfo.container_img:for truth checksReplace with
ruleinfo.container_img(E712)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
snakemake/cli.py(2 hunks)snakemake/cli.py(1 hunks)snakemake/cli.py(1 hunks)snakemake/workflow.py(1 hunks)snakemake/cli.py(2 hunks)snakemake/cli.py(1 hunks)snakemake/cli.py(1 hunks)snakemake/workflow.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- snakemake/cli.py
- snakemake/cli.py
- snakemake/cli.py
- snakemake/cli.py
- snakemake/cli.py
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest ...
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
snakemake/workflow.pysnakemake/cli.py
🪛 Ruff (0.8.2)
snakemake/workflow.py
1813-1813: Avoid inequality comparisons to False; use if ruleinfo.container_img: for truth checks
Replace with ruleinfo.container_img
(E712)
🔇 Additional comments (6)
snakemake/cli.py (2)
73-76: Excellent implementation of the human-friendly size parserThe function correctly imports and uses the
parse_sizefunction from thehumanfriendlylibrary, allowing users to specify sizes in a more readable format (e.g., "1GB", "500MB").
1308-1314: Good enhancement to the argument parser for file size specificationReplacing
type=intwithparse_func=parse_size_in_bytesimproves user experience by allowing human-readable size formats (like "100MB" or "1GB") instead of requiring raw byte counts.Note that this implements johanneskoester's suggestion from the previous review to use the humanfriendly package for parsing size values.
snakemake/workflow.py (4)
1779-1787: Well-designed refactoring of validation logicThe new
check_may_use_software_deploymentfunction nicely centralizes the logic for validating whether a rule can use specific software deployment methods. This improves maintainability by reducing code duplication and ensuring consistent error messages.
1794-1794: Clean code refactoring using the new helper functionGood usage of the newly created validation function for the envmodules directive.
1800-1800: Clean code refactoring using the new helper functionGood usage of the newly created validation function for the conda directive.
1808-1808: Clean code refactoring using the new helper functionGood usage of the newly created validation function for the container/singularity directive.
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 1
🛑 Comments failed to post (1)
snakemake/ioutils/input.py (1)
12-30:
⚠️ Potential issueFix import error handling and lambda usage in the
extract_checksumfunction.There are several issues that need to be addressed:
WorkflowErroris used but not imported- Lambda function usage where a regular function is preferred
- Exception handling doesn't preserve the original exception context
Apply these changes:
+from snakemake.exceptions import WorkflowError def extract_checksum(infile, **kwargs): try: import pandas as pd - fix_file_name = lambda x: x.removeprefix("./") + def fix_file_name(x): + return x.removeprefix("./") + return ( pd.read_csv( infile, sep=" ", header=None, engine="python", converters={1: fix_file_name}, ) .set_index(1) .loc[fix_file_name(kwargs.get("file"))] .item() ) - except ImportError: - raise WorkflowError("Pandas is required to extract checksum from file.") + except ImportError as e: + raise WorkflowError("Pandas is required to extract checksum from file.") from e📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.from snakemake.exceptions import WorkflowError def extract_checksum(infile, **kwargs): try: import pandas as pd def fix_file_name(x): return x.removeprefix("./") return ( pd.read_csv( infile, sep=" ", header=None, engine="python", converters={1: fix_file_name}, ) .set_index(1) .loc[fix_file_name(kwargs.get("file"))] .item() ) except ImportError as e: raise WorkflowError("Pandas is required to extract checksum from file.") from e🧰 Tools
🪛 Ruff (0.8.2)
16-16: Do not assign a
lambdaexpression, use adefRewrite
fix_file_nameas adef(E731)
30-30: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
30-30: Undefined name
WorkflowError(F821)
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 1
🧹 Nitpick comments (11)
snakemake/utils.py (4)
130-134: Consider chaining the raised exception.Use exception chaining to improve tracebacks, for example:
- raise WorkflowError( - f"Error validating row {i} of data frame.", e - ) + raise WorkflowError( + f"Error validating row {i} of data frame." + ) from e🧰 Tools
🪛 Ruff (0.8.2)
132-134: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
166-170: Consider chaining the raised exception.Same rationale as above; use
from eto trace the original cause.🧰 Tools
🪛 Ruff (0.8.2)
168-170: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
205-208: Consider exception chaining for better tracebacks.Apply the same solution with
from ein this block.🧰 Tools
🪛 Ruff (0.8.2)
206-208: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
222-224: Chaining the exception clarifies error origins.Use
raise WorkflowError("...", e) from eor similar to chain the original exception.🧰 Tools
🪛 Ruff (0.8.2)
224-224: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
.github/workflows/main.yml (2)
51-54: Address empty exclude and trailing spaces.Consider removing the empty
exclude: []block if not needed, and eliminate trailing whitespace on line 53.🧰 Tools
🪛 actionlint (1.7.4)
52-52: "exclude" section should not be empty
(syntax-check)
🪛 YAMLlint (1.35.1)
[error] 53-53: trailing spaces
(trailing-spaces)
210-218: Add a newline at the end of file.The YAML linter flags a missing newline at line 218. Also, the new
testing-donejob logic looks solid.🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 218-218: no new line character at the end of file
(new-line-at-end-of-file)
tests/test_validate/samples.schema.yaml (1)
9-17: Consider stricter validation rules.
Currently,ndefaults to 0, andtissuedefaults to "blood." If these values have meaningful constraints (e.g.,n >= 1or a limited set of valid tissues), adding constraints (e.g.,minimum,enum) could improve data integrity. Otherwise, 0 or "blood" might silently hide errors.snakemake/assets/__init__.py (2)
96-351: Consider centralizing asset definitions.
A large block of asset references has been added for Vega and its dependencies. Storing them in a separate data structure or file could reduce duplication (e.g., repeated SHA256 values) and simplify updates.
510-530: Check for potential future updates in prop-types dependencies.
Though many tree-shaken dependencies are not included inprop-types.min.js, maintaining these references in code may complicate updates. Consider a system to dynamically fetch license references or store them in a standalone configuration file for quicker updates.tests/test_validate/Snakefile (2)
11-27: Reduce duplication by validating more rows programmatically.
Lines 11–21 and 22–27 both grab the first two rows in different DF contexts. If your schema applies to all rows, consider validating every row or the entire DataFrame once, rather than duplicating similar logic for each row index.
41-50: Confirm LazyFrame behavior with schema validation.
You are validating the LazyFrame withset_default=Falseprior to calling.collect(). Ensure that any default values or schema-based transformations occur as intended. If not, callingvalidate()after materializing the data might be more accurate.
🛑 Comments failed to post (1)
tests/test_validate/Snakefile (1)
55-57: 🛠️ Refactor suggestion
Consolidate index-based validations.
Lines 55–57 repeat checks done earlier on the first two rows. If the entire DataFrame must conform, consider applying a single validation pass for all rows or unify index-based checks into a helper function to reduce repetition.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
snakemake/cli.py (1)
1317-1324: Good implementation of the new configuration parameterThe addition of the
--max-checksum-file-sizeparameter is well-implemented with a clear help message, appropriate default value, and proper integration with the argument parser.One suggestion would be to make the help text more specific by including the default value and giving an example of how to specify sizes in human-readable format:
help="Compute the checksum during DAG computation and job postprocessing " - "only for files that are smaller than the provided threshold.", + "only for files that are smaller than the provided threshold. Default is 1MB. " + "Values can be specified in human-readable format, e.g., '10MB', '1GB'.",
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
snakemake/cli.py(3 hunks)snakemake/dag.py(4 hunks)snakemake/persistence.py(2 hunks)snakemake/settings/types.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- snakemake/settings/types.py
- snakemake/persistence.py
- snakemake/dag.py
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest ...
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
snakemake/cli.py
⏰ Context from checks skipped due to timeout of 90000ms (30)
- GitHub Check: testing (10, 3.12, bash)
- GitHub Check: testing (10, 3.11, bash)
- GitHub Check: testing (9, 3.12, bash)
- GitHub Check: testing (9, 3.11, bash)
- GitHub Check: testing (8, 3.12, bash)
- GitHub Check: testing (8, 3.11, bash)
- GitHub Check: testing (7, 3.12, bash)
- GitHub Check: testing (7, 3.11, bash)
- GitHub Check: testing (6, 3.12, bash)
- GitHub Check: testing (6, 3.11, bash)
- GitHub Check: testing (5, 3.12, bash)
- GitHub Check: testing-windows (10)
- GitHub Check: testing (5, 3.11, bash)
- GitHub Check: testing-windows (9)
- GitHub Check: testing (4, 3.12, bash)
- GitHub Check: testing-windows (8)
- GitHub Check: testing (4, 3.11, bash)
- GitHub Check: testing-windows (7)
- GitHub Check: testing (3, 3.12, bash)
- GitHub Check: testing-windows (6)
- GitHub Check: testing (3, 3.11, bash)
- GitHub Check: testing-windows (5)
- GitHub Check: testing (2, 3.12, bash)
- GitHub Check: testing-windows (4)
- GitHub Check: testing (2, 3.11, bash)
- GitHub Check: testing-windows (3)
- GitHub Check: testing (1, 3.12, bash)
- GitHub Check: testing-windows (2)
- GitHub Check: testing (1, 3.11, bash)
- GitHub Check: testing-windows (1)
🔇 Additional comments (2)
snakemake/cli.py (2)
73-76: Consider simplifying by using humanfriendly.parse_size directlyThis function is a simple wrapper around
humanfriendly.parse_size. As suggested in a previous comment by johanneskoester, you could passhumanfriendly.parse_sizedirectly to the argument parser to eliminate this wrapper function.- def parse_size_in_bytes(value): - from humanfriendly import parse_size - - return parse_size(value)Then in the argument parser definition (lines 1317-1324), you would modify it like this:
group_behavior.add_argument( "--max-checksum-file-size", default=1000000, metavar="BYTES", - parse_func=parse_size_in_bytes, + parse_func=humanfriendly.parse_size, help="Compute the checksum during DAG computation and job postprocessing " "only for files that are smaller than the provided threshold.", )Don't forget to add
from humanfriendly import parse_sizenear the top of the file with the other imports.
2121-2121: Correctly integrated the new parameter into DAG settingsThe
max_checksum_file_sizeparameter is properly passed to theDAGSettingsclass, ensuring it will be available during DAG computation.
|
🤖 I have created a release *beep* *boop* --- ## [9.0.0](v8.30.0...v9.0.0) (2025-03-14) ### ⚠ BREAKING CHANGES * Logging refactor & add LoggerPluginInterface ([#3107](#3107)) ### Features * [#3412](#3412) - keep shadow folder of failed job if --keep-incomplete flag is set. ([#3430](#3430)) ([22978c3](22978c3)) * add flag --report-after-run to automatically generate the report after a successfull workflow run ([#3428](#3428)) ([b0a7f03](b0a7f03)) * add flatten function to IO utils ([#3424](#3424)) ([67fa392](67fa392)) * add helper functions to parse input files ([#2918](#2918)) ([63e45a7](63e45a7)) * Add option to print redacted file names ([#3089](#3089)) ([ba4d264](ba4d264)) * add support for validation of polars dataframe and lazyframe ([#3262](#3262)) ([c7473a6](c7473a6)) * added support for rendering dag with mermaid js ([#3409](#3409)) ([7bf8381](7bf8381)) * adding --replace-workflow-config to fully replace workflow configs (from config: directive) with --configfile, instead of merging them ([#3381](#3381)) ([47504a0](47504a0)) * Dynamic module name ([#3401](#3401)) ([024dc32](024dc32)) * Enable saving and reloading IOCache object ([#3386](#3386)) ([c935953](c935953)) * files added in rule params with workflow.source_path will be available in used containers ([#3385](#3385)) ([a6e45bf](a6e45bf)) * Fix keep_local in storage directive and more freedom over remote retrieval behaviour ([#3410](#3410)) ([67b4739](67b4739)) * inherit parameters of use rule and extend/replace individual items them when using 'with' directive ([#3365](#3365)) ([93e4b92](93e4b92)) * Logging refactor & add LoggerPluginInterface ([#3107](#3107)) ([86f1d6e](86f1d6e)) * Maximal file size for checksums ([#3368](#3368)) ([b039f8a](b039f8a)) * Modernize package configuration using Pixi ([#3369](#3369)) ([77992d8](77992d8)) * multiext support for named input/output ([#3372](#3372)) ([05e1378](05e1378)) * optionally auto-group jobs via temp files in case of remote execution ([#3378](#3378)) ([cc9bba2](cc9bba2)) ### Bug Fixes * `--delete-all-output` ignores `--dry-run` ([#3265](#3265)) ([23fef82](23fef82)) * 3342 faster touch runs and warning messages for non-existing files ([#3398](#3398)) ([cd9c3c3](cd9c3c3)) * add default value to max-jobs-per-timespan ([#3043](#3043)) ([2959abe](2959abe)) * checkpoints inside modules are overwritten ([#3359](#3359)) ([fba3ac7](fba3ac7)) * Convert Path to IOFile ([#3405](#3405)) ([c58684c](c58684c)) * Do not perform storage object cleanup with --keep-storage-local-copies set ([#3358](#3358)) ([9a6d14b](9a6d14b)) * edgecases of source deployment in case of remote execution ([#3396](#3396)) ([5da13be](5da13be)) * enhance error message formatting for strict DAG-building mode ([#3376](#3376)) ([a1c39ee](a1c39ee)) * fix bug in checkpoint handling that led to exceptions in case checkpoint output was missing upon rerun ([#3423](#3423)) ([8cf4a2f](8cf4a2f)) * force check all required outputs ([#3341](#3341)) ([495a4e7](495a4e7)) * group job formatting ([#3442](#3442)) ([f0b10a3](f0b10a3)) * in remote jobs, upload storage in topological order such that modification dates are preserved (e.g. in case of group jobs) ([#3377](#3377)) ([eace08f](eace08f)) * only skip eval when resource depends on input ([#3374](#3374)) ([4574c92](4574c92)) * Prevent execution of conda in apptainer when not explicitly requested in software deployment method ([#3388](#3388)) ([c43c5c0](c43c5c0)) * print filenames with quotes around them in RuleException ([#3269](#3269)) ([6baeda5](6baeda5)) * Re-evaluation of free resources ([#3399](#3399)) ([6371293](6371293)) * ReadTheDocs layout issue due to src directory change ([#3419](#3419)) ([695b127](695b127)) * robustly escaping quotes in generated bash scripts (v2) ([#3297](#3297)) ([#3389](#3389)) ([58720bd](58720bd)) * Show apptainer image URL in snakemake report ([#3407](#3407)) ([45f0450](45f0450)) * Update ReadTheDocs configuration for documentation build to use Pixi ([#3433](#3433)) ([3f227a6](3f227a6)) ### Documentation * Add pixi setup instructions to general use tutorial ([#3382](#3382)) ([115e81b](115e81b)) * fix contribution section heading levels, fix docs testing setup order ([#3360](#3360)) ([051dc53](051dc53)) * fix link to github.com/snakemake/poetry-snakemake-plugin ([#3436](#3436)) ([ec6d97c](ec6d97c)) * fix quoting ([#3394](#3394)) ([b40f599](b40f599)) * fix rerun-triggers default ([#3403](#3403)) ([4430e23](4430e23)) * fix typo 'safe' -> 'save' ([#3384](#3384)) ([7755861](7755861)) * mention code formatting in the contribution section ([#3431](#3431)) ([e8682b7](e8682b7)) * remove duplicated 'functions'. ([#3356](#3356)) ([7c595db](7c595db)) * update broken links documentation ([#3437](#3437)) ([e3d0d88](e3d0d88)) * Updating contributing guidelines with new pixi dev setup ([#3415](#3415)) ([8e95a12](8e95a12)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: snakemake-bot <snakemake-bot-admin@googlegroups.com>
Add a new parameter `--max-checksum-file-size` to adjust the maximal file size for which a checksum is computed when the DAG is build. This is related to snakemake#2949 ### QC <!-- Make sure that you can tick the boxes below. --> * [ ] The PR contains a test case for the changes or the changes are already covered by an existing test case. * [ ] The documentation (`docs/`) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake). <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Added a new command-line option to let users specify a maximum file size for checksum processing, with a default threshold of 1,000,000 bytes. - **Refactor** - Updated file validation and checksum routines to enforce the new file size limit, ensuring more efficient processing across workflow operations. - Made method signatures more explicit by removing default values for threshold parameters in checksum-related methods. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Johanna <johanna_schmitz@posteo.de> Co-authored-by: johannaschmitz <56937653+johannaschmitz@users.noreply.github.com> Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
🤖 I have created a release *beep* *boop* --- ## [9.0.0](snakemake/snakemake@v8.30.0...v9.0.0) (2025-03-14) ### ⚠ BREAKING CHANGES * Logging refactor & add LoggerPluginInterface ([snakemake#3107](snakemake#3107)) ### Features * [snakemake#3412](snakemake#3412) - keep shadow folder of failed job if --keep-incomplete flag is set. ([snakemake#3430](snakemake#3430)) ([22978c3](snakemake@22978c3)) * add flag --report-after-run to automatically generate the report after a successfull workflow run ([snakemake#3428](snakemake#3428)) ([b0a7f03](snakemake@b0a7f03)) * add flatten function to IO utils ([snakemake#3424](snakemake#3424)) ([67fa392](snakemake@67fa392)) * add helper functions to parse input files ([snakemake#2918](snakemake#2918)) ([63e45a7](snakemake@63e45a7)) * Add option to print redacted file names ([snakemake#3089](snakemake#3089)) ([ba4d264](snakemake@ba4d264)) * add support for validation of polars dataframe and lazyframe ([snakemake#3262](snakemake#3262)) ([c7473a6](snakemake@c7473a6)) * added support for rendering dag with mermaid js ([snakemake#3409](snakemake#3409)) ([7bf8381](snakemake@7bf8381)) * adding --replace-workflow-config to fully replace workflow configs (from config: directive) with --configfile, instead of merging them ([snakemake#3381](snakemake#3381)) ([47504a0](snakemake@47504a0)) * Dynamic module name ([snakemake#3401](snakemake#3401)) ([024dc32](snakemake@024dc32)) * Enable saving and reloading IOCache object ([snakemake#3386](snakemake#3386)) ([c935953](snakemake@c935953)) * files added in rule params with workflow.source_path will be available in used containers ([snakemake#3385](snakemake#3385)) ([a6e45bf](snakemake@a6e45bf)) * Fix keep_local in storage directive and more freedom over remote retrieval behaviour ([snakemake#3410](snakemake#3410)) ([67b4739](snakemake@67b4739)) * inherit parameters of use rule and extend/replace individual items them when using 'with' directive ([snakemake#3365](snakemake#3365)) ([93e4b92](snakemake@93e4b92)) * Logging refactor & add LoggerPluginInterface ([snakemake#3107](snakemake#3107)) ([86f1d6e](snakemake@86f1d6e)) * Maximal file size for checksums ([snakemake#3368](snakemake#3368)) ([b039f8a](snakemake@b039f8a)) * Modernize package configuration using Pixi ([snakemake#3369](snakemake#3369)) ([77992d8](snakemake@77992d8)) * multiext support for named input/output ([snakemake#3372](snakemake#3372)) ([05e1378](snakemake@05e1378)) * optionally auto-group jobs via temp files in case of remote execution ([snakemake#3378](snakemake#3378)) ([cc9bba2](snakemake@cc9bba2)) ### Bug Fixes * `--delete-all-output` ignores `--dry-run` ([snakemake#3265](snakemake#3265)) ([23fef82](snakemake@23fef82)) * 3342 faster touch runs and warning messages for non-existing files ([snakemake#3398](snakemake#3398)) ([cd9c3c3](snakemake@cd9c3c3)) * add default value to max-jobs-per-timespan ([snakemake#3043](snakemake#3043)) ([2959abe](snakemake@2959abe)) * checkpoints inside modules are overwritten ([snakemake#3359](snakemake#3359)) ([fba3ac7](snakemake@fba3ac7)) * Convert Path to IOFile ([snakemake#3405](snakemake#3405)) ([c58684c](snakemake@c58684c)) * Do not perform storage object cleanup with --keep-storage-local-copies set ([snakemake#3358](snakemake#3358)) ([9a6d14b](snakemake@9a6d14b)) * edgecases of source deployment in case of remote execution ([snakemake#3396](snakemake#3396)) ([5da13be](snakemake@5da13be)) * enhance error message formatting for strict DAG-building mode ([snakemake#3376](snakemake#3376)) ([a1c39ee](snakemake@a1c39ee)) * fix bug in checkpoint handling that led to exceptions in case checkpoint output was missing upon rerun ([snakemake#3423](snakemake#3423)) ([8cf4a2f](snakemake@8cf4a2f)) * force check all required outputs ([snakemake#3341](snakemake#3341)) ([495a4e7](snakemake@495a4e7)) * group job formatting ([snakemake#3442](snakemake#3442)) ([f0b10a3](snakemake@f0b10a3)) * in remote jobs, upload storage in topological order such that modification dates are preserved (e.g. in case of group jobs) ([snakemake#3377](snakemake#3377)) ([eace08f](snakemake@eace08f)) * only skip eval when resource depends on input ([snakemake#3374](snakemake#3374)) ([4574c92](snakemake@4574c92)) * Prevent execution of conda in apptainer when not explicitly requested in software deployment method ([snakemake#3388](snakemake#3388)) ([c43c5c0](snakemake@c43c5c0)) * print filenames with quotes around them in RuleException ([snakemake#3269](snakemake#3269)) ([6baeda5](snakemake@6baeda5)) * Re-evaluation of free resources ([snakemake#3399](snakemake#3399)) ([6371293](snakemake@6371293)) * ReadTheDocs layout issue due to src directory change ([snakemake#3419](snakemake#3419)) ([695b127](snakemake@695b127)) * robustly escaping quotes in generated bash scripts (v2) ([snakemake#3297](snakemake#3297)) ([snakemake#3389](snakemake#3389)) ([58720bd](snakemake@58720bd)) * Show apptainer image URL in snakemake report ([snakemake#3407](snakemake#3407)) ([45f0450](snakemake@45f0450)) * Update ReadTheDocs configuration for documentation build to use Pixi ([snakemake#3433](snakemake#3433)) ([3f227a6](snakemake@3f227a6)) ### Documentation * Add pixi setup instructions to general use tutorial ([snakemake#3382](snakemake#3382)) ([115e81b](snakemake@115e81b)) * fix contribution section heading levels, fix docs testing setup order ([snakemake#3360](snakemake#3360)) ([051dc53](snakemake@051dc53)) * fix link to github.com/snakemake/poetry-snakemake-plugin ([snakemake#3436](snakemake#3436)) ([ec6d97c](snakemake@ec6d97c)) * fix quoting ([snakemake#3394](snakemake#3394)) ([b40f599](snakemake@b40f599)) * fix rerun-triggers default ([snakemake#3403](snakemake#3403)) ([4430e23](snakemake@4430e23)) * fix typo 'safe' -> 'save' ([snakemake#3384](snakemake#3384)) ([7755861](snakemake@7755861)) * mention code formatting in the contribution section ([snakemake#3431](snakemake#3431)) ([e8682b7](snakemake@e8682b7)) * remove duplicated 'functions'. ([snakemake#3356](snakemake#3356)) ([7c595db](snakemake@7c595db)) * update broken links documentation ([snakemake#3437](snakemake#3437)) ([e3d0d88](snakemake@e3d0d88)) * Updating contributing guidelines with new pixi dev setup ([snakemake#3415](snakemake#3415)) ([8e95a12](snakemake@8e95a12)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: snakemake-bot <snakemake-bot-admin@googlegroups.com>



Add a new parameter
--max-checksum-file-sizeto adjust the maximal file size for which a checksum is computed when the DAG is build.This is related to #2949
QC
docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).Summary by CodeRabbit
New Features
Refactor