fix: Improved handling of missing output files in group job postprocessing, accounting for temporary files.#1765
Conversation
|
This looks nice because it also solves a feature I would have liked to have: #1474 It solves the whole "files that only live for the duration of the group running on the node" part of the problem. |
|
One clarification: sub-snakemake instances on the cluster actually do not automatically delete the temp files (they get called with |
|
Kudos, SonarCloud Quality Gate passed!
|
|
@pvandyken : Yes, I meant that we could start also storing our on node-local scratch, completely aleviating all problems of file that don't even live beyond group job. We're currently testing this on our production server. (:notes: Living on the edge! :guitar: :man_singer:) |
|
@DrYak Ah, no worries, I was clarifying myself because I made a mistake in the initial post. I wanted to correct it for the record. |
c0353dd to
05d92e0
Compare
|
Kudos, SonarCloud Quality Gate passed!
|
Files produced by jobs in the middle of a group job that are neither endpoints nor needed by any other jobs in the dag are no longer subject to an existance check
|
Kudos, SonarCloud Quality Gate passed!
|
|
Warning Rate limit exceeded@johanneskoester has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 10 minutes and 34 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (2)
📝 Walkthrough📝 Walkthrough📝 Walkthrough📝 Walkthrough📝 WalkthroughWalkthroughThe pull request introduces modifications to two Snakemake core files: Changes
Sequence DiagramsequenceDiagram
participant GroupJob
participant DAG
participant Job
GroupJob->>Job: Check temporary output files
Job-->>GroupJob: File requirements
GroupJob->>GroupJob: Determine unnecessary files
GroupJob->>DAG: Pass ignore_missing_output list
DAG->>DAG: Filter output files
DAG-->>Job: Processed output files
The sequence diagram illustrates the new workflow for handling output files, showing how the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
snakemake/jobs.py (1)
1464-1474: LGTM! Consider extracting the condition for better readability.The
neededfunction correctly determines if a file is required by pending jobs or is a target file. The implementation is accurate and solves the issue.Consider extracting the complex condition into a more readable form:
- if ( - not self.dag.finished(j) - and self.dag.needrun(j) - and j not in self.jobs - ) + def is_pending_job(j): + return (not self.dag.finished(j) + and self.dag.needrun(j) + and j not in self.jobs) + + if is_pending_job(j)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
snakemake/dag.py(1 hunks)snakemake/jobs.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
snakemake/jobs.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
snakemake/dag.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
🔇 Additional comments (2)
snakemake/jobs.py (1)
1475-1478: LGTM! The implementation correctly filters unneeded temp files.
The list comprehension effectively filters temporary output files that are not needed by any pending jobs, solving the issue of unnecessary temp file checks.
snakemake/dag.py (1)
582-587: LGTM! The implementation correctly handles the list of files to ignore.
The changes properly handle the case when ignore_missing_output is a list by:
- Filtering out the ignored files from
expanded_output - Resetting
ignore_missing_outputto maintain compatibility with the rest of the method
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
snakemake/jobs.py (1)
Line range hint
1537-1575: Please add test coverage for the new functionality.The implementation looks solid and addresses the issue well. However, as mentioned in the PR description, test coverage is needed. Consider adding tests that verify:
- Temporary files not needed outside the group are properly ignored
- Temporary files still needed by other jobs are not ignored
- Edge cases with mixed temporary and non-temporary files
Would you like help creating these test cases?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
snakemake/dag.py(1 hunks)snakemake/jobs.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
snakemake/jobs.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
snakemake/dag.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
🔇 Additional comments (5)
snakemake/dag.py (1)
693-698: LGTM: Proper handling of ignored output files
The implementation correctly:
- Filters out files that should be ignored from expanded_output
- Resets ignore_missing_output to empty list to maintain backward compatibility
snakemake/jobs.py (4)
1537-1546: LGTM! Well-implemented file necessity check.
The needed function correctly determines if a file is required by:
- Checking if it's a target file
- Checking if it's not temporary
- Checking if it's needed by pending jobs outside the group
1548-1552: LGTM! Correctly identifies unnecessary temporary files.
The implementation properly filters temporary output files that are not needed outside of the group job, which directly addresses the issue with cluster execution.
1570-1575: LGTM! Proper handling of postprocess parameters.
The implementation correctly:
- Passes the filtered list of files to ignore
- Disables output mtime checks to prevent interference with intermediate file updates
- Maintains proper error handling
1146-1146: LGTM! Note the breaking change in parameter type.
The parameter type change from bool to List[_IOFile] is well-aligned with the PR objectives. This allows for more granular control over which missing output files to ignore.
Let's verify if there are any external callers that might be affected:
✅ Verification successful
Parameter type change is safe to proceed
The search results show that all existing calls to postprocess() either:
- Don't pass the
ignore_missing_outputparameter at all (using default value) - Pass it through from another function call (in jobs.py)
This indicates the type change from bool to List[_IOFile] is safe as there are no direct callers that would be affected by this breaking change.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for calls to postprocess with ignore_missing_output parameter
rg "postprocess.*ignore_missing_output\s*=" --type py
Length of output: 55
Script:
#!/bin/bash
# Let's try a broader search for postprocess method calls
rg "\.postprocess\(" --type py -A 3
# Also search for the method definition to understand its usage
ast-grep --pattern 'def postprocess($$$) { $$$ }'
Length of output: 2592
|
|
@johanneskoester thanks for merging! |
🤖 I have created a release *beep* *boop* --- ## [8.26.0](v8.25.5...v8.26.0) (2024-12-23) ### Features * add helpers for deferred input/output etc. item access ([#2927](#2927)) ([2cca9bc](2cca9bc)) ### Bug Fixes * convert parameters so they can be serialized ([#2925](#2925)) ([9e653fb](9e653fb)) * correct formatting of R preamble ([#2425](#2425)) ([5380cae](5380cae)) * fix modification checks for scripts and and notebooks containing wildcards or params in their paths ([#2751](#2751)) ([773568d](773568d)) * Improved handling of missing output files in group job postprocessing, accounting for temporary files. ([#1765](#1765)) ([bac06ba](bac06ba)) * mtime of script or notebook not triggering workflow without metadata ([#3148](#3148)) ([e8a0b83](e8a0b83)) * Pass `host` attribute to `GitlabFile` instantiation within class methods ([#3155](#3155)) ([9ef52de](9ef52de)) * problem with spaces in path ([#3236](#3236)) ([2d08c63](2d08c63)) * require current yte release which contains an important bug fix for cases where numpy/pandas data is passed to templates ([#3227](#3227)) ([c3339da](c3339da)) * rerun jobs if previously failed but rule was changed afterwards (thanks to [@laf070810](https://github.com/laf070810) for bringing this up) ([#3237](#3237)) ([1dc0084](1dc0084)) * use relpath for configfiles added to the source archive (thanks to [@sposadac](https://github.com/sposadac) for the initial solution) ([#3240](#3240)) ([bff3844](bff3844)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…ssing, accounting for temporary files. (snakemake#1765) When I run group jobs on a cluster, I'll often write temporary files to the node-specific local scratch folder. I'll label them `temp(...)` and the sub-snakemake instance will dutifully delete the files when no longer needed. Even without that deletion, the files would disappear with the node local scratch as soon as the group job finishes. The problem occurs when the parent snakemake instance re-assumes control. It checks for all the outputs from the group job, finds the temp files missing, and exits with an error. But these temp files are no longer needed anywhere in the workflow, and indeed, Snakemake's immediate next step if it were to find the temp files would be to delete them. This PR filters the group job outputs before the existence checks, removing any that are: - Not needed anywhere else in the workflow - Not target files In particular, only files that would otherwise be immediately removed are excluded from the existence check (at least, that's the idea). The idea is to remove a small frustration without causing any breaking or generally noticeable changes. I've no test yet, but I'll write one if the concept here is approved. ### QC <!-- Make sure that you can tick the boxes below. --> * [ ] The PR contains a test case for the changes or the changes are already covered by an existing test case. * [ ] The documentation (`docs/`) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake). <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced output checking mechanism to selectively ignore specific output files. - Improved handling of missing output files in job postprocessing, accounting for temporary files. - **Bug Fixes** - Refined logic for determining if files are needed by pending jobs, improving accuracy in output management. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de> Co-authored-by: Johannes Koester <johannes.koester@uni-due.de>
🤖 I have created a release *beep* *boop* --- ## [8.26.0](snakemake/snakemake@v8.25.5...v8.26.0) (2024-12-23) ### Features * add helpers for deferred input/output etc. item access ([snakemake#2927](snakemake#2927)) ([2cca9bc](snakemake@2cca9bc)) ### Bug Fixes * convert parameters so they can be serialized ([snakemake#2925](snakemake#2925)) ([9e653fb](snakemake@9e653fb)) * correct formatting of R preamble ([snakemake#2425](snakemake#2425)) ([5380cae](snakemake@5380cae)) * fix modification checks for scripts and and notebooks containing wildcards or params in their paths ([snakemake#2751](snakemake#2751)) ([773568d](snakemake@773568d)) * Improved handling of missing output files in group job postprocessing, accounting for temporary files. ([snakemake#1765](snakemake#1765)) ([bac06ba](snakemake@bac06ba)) * mtime of script or notebook not triggering workflow without metadata ([snakemake#3148](snakemake#3148)) ([e8a0b83](snakemake@e8a0b83)) * Pass `host` attribute to `GitlabFile` instantiation within class methods ([snakemake#3155](snakemake#3155)) ([9ef52de](snakemake@9ef52de)) * problem with spaces in path ([snakemake#3236](snakemake#3236)) ([2d08c63](snakemake@2d08c63)) * require current yte release which contains an important bug fix for cases where numpy/pandas data is passed to templates ([snakemake#3227](snakemake#3227)) ([c3339da](snakemake@c3339da)) * rerun jobs if previously failed but rule was changed afterwards (thanks to [@laf070810](https://github.com/laf070810) for bringing this up) ([snakemake#3237](snakemake#3237)) ([1dc0084](snakemake@1dc0084)) * use relpath for configfiles added to the source archive (thanks to [@sposadac](https://github.com/sposadac) for the initial solution) ([snakemake#3240](snakemake#3240)) ([bff3844](snakemake@bff3844)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>











When I run group jobs on a cluster, I'll often write temporary files to the node-specific local scratch folder. I'll label them
temp(...)and the sub-snakemake instance will dutifully delete the files when no longer needed. Even without that deletion, the files would disappear with the node local scratch as soon as the group job finishes.The problem occurs when the parent snakemake instance re-assumes control. It checks for all the outputs from the group job, finds the temp files missing, and exits with an error. But these temp files are no longer needed anywhere in the workflow, and indeed, Snakemake's immediate next step if it were to find the temp files would be to delete them.
This PR filters the group job outputs before the existence checks, removing any that are:
In particular, only files that would otherwise be immediately removed are excluded from the existence check (at least, that's the idea).
The idea is to remove a small frustration without causing any breaking or generally noticeable changes. I've no test yet, but I'll write one if the concept here is approved.
QC
docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).Summary by CodeRabbit
New Features
Bug Fixes