Skip to content

fix: omit storage downloads during dryrun in workflows with checkpoints#3100

Merged
johanneskoester merged 3 commits intomainfrom
fix/checkpoint-dryrun-no-storage-download
Sep 25, 2024
Merged

fix: omit storage downloads during dryrun in workflows with checkpoints#3100
johanneskoester merged 3 commits intomainfrom
fix/checkpoint-dryrun-no-storage-download

Conversation

@johanneskoester
Copy link
Copy Markdown
Contributor

@johanneskoester johanneskoester commented Sep 24, 2024

fixes snakemake/snakemake-storage-plugin-s3#29

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

Summary by CodeRabbit

  • New Features

    • Introduced a new Snakemake workflow for managing file storage operations.
    • Added a checkpoint and rules to ensure proper execution flow for file operations.
  • Bug Fixes

    • Enhanced control flow to prevent storage input retrieval during dry runs.
  • Tests

    • Added a new test to validate storage retrieval failure scenarios during dry runs.
  • Documentation

    • New test file and configuration details added for clarity on workflow operations.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Sep 24, 2024

Walkthrough

The changes introduced in this pull request modify the control flow within the postprocess_after_update asynchronous method in snakemake/dag.py. The update adds a condition to check if the workflow is not in a dry run state before retrieving storage inputs. This adjustment refines when the retrieve_storage_inputs method is invoked, specifically ensuring it only executes during actual runs rather than dry runs. Additionally, new test cases and a Snakemake workflow are introduced to validate this behavior.

Changes

File Change Summary
snakemake/dag.py Added a condition in postprocess_after_update to check for not self.workflow.dryrun before calling retrieve_storage_inputs.
tests/test_storage_noretrieve_dryrun/Snakefile Introduced a Snakemake workflow with rules and checkpoints for file storage operations.
tests/test_storage_noretrieve_dryrun/fs/test.txt Added a new file test.txt containing the text "test".
tests/tests.py Added a test function test_storage_noretrieve_dryrun to simulate dry run behavior and validate storage retrieval failure.

Assessment against linked issues

Objective Addressed Explanation
Not download file during dry-run (#29)

Possibly related PRs


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between c67ba83 and 4d4db25.

📒 Files selected for processing (1)
  • tests/test_storage_noretrieve_dryrun/Snakefile (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_storage_noretrieve_dryrun/Snakefile

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Outside diff range and nitpick comments (1)
snakemake/dag.py (1)

Line range hint 1-1862: Summary of changes and recommendations

The changes made to the postprocess_after_update method in snakemake/dag.py successfully implement the PR objective of omitting storage downloads during dry runs in workflows with checkpoints. The modification is minimal and focused, which reduces the risk of introducing bugs.

However, to ensure the robustness of this change, I recommend the following:

  1. Add unit tests specifically for this new behavior, covering various scenarios of dry runs with and without shared input/output settings.
  2. Update the documentation to reflect this new behavior in dry run scenarios.
  3. Consider adding a comment in the code explaining the rationale behind this change, which will help future maintainers understand the intent.

To improve the overall architecture and maintainability of the code:

  1. Consider extracting the condition for storage download into a separate method, e.g., should_retrieve_storage_inputs(). This would improve readability and make it easier to test this logic independently.
  2. Evaluate if this change should be reflected in other parts of the DAG implementation or related classes to ensure consistent behavior across the entire workflow.

The implementation is approved, but please address the recommendations to ensure the change is fully integrated and tested within the Snakemake ecosystem.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 4baebfb and f986e63.

Files selected for processing (1)
  • snakemake/dag.py (1 hunks)
Additional context used
Path-based instructions (1)
snakemake/dag.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Comment on lines +1859 to +1862
if not self.workflow.dryrun and (
(self.workflow.is_main_process and shared_input_output)
or self.workflow.remote_exec
):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Add Tests to Verify Storage Behavior During Dry Runs

The implementation correctly introduces a condition to prevent storage downloads during dry runs, aligning with the PR objective to omit storage downloads in workflows with checkpoints.

However, the current test suite lacks specific tests that verify this behavior. To ensure the change works as intended and to prevent future regressions, please add tests that cover storage operations during dry runs.

  • Location to Add Tests:
    • tests/tests.py
Analysis chain

Ensure dry run behavior is consistent with PR objectives

The changes introduce a new condition to prevent storage downloads during dry runs. This aligns with the PR objective to "omit storage downloads during dryrun in workflows with checkpoints".

However, there are a few points to consider:

  1. The condition now includes both dry run and shared input/output checks.
  2. The indentation suggests that this condition applies to both the main process and remote execution scenarios.

To ensure this change doesn't introduce unintended side effects, please run the following verification:

The implementation looks correct and aligns with the PR objectives. However, please ensure that this change doesn't affect other parts of the workflow that might rely on storage downloads during dry runs.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any other occurrences of self.workflow.dryrun in the file
# to ensure consistent behavior across the DAG implementation
rg --type python 'self\.workflow\.dryrun' snakemake/dag.py

# Verify if there are any tests covering dry run scenarios with storage
rg --type python 'def test.*dryrun.*storage' tests/

Length of output: 176


Script:

#!/bin/bash
# Search for occurrences of self.workflow.dryrun in Python files within snakemake/dag.py
fd --extension py --type file | grep "^snakemake/dag\.py$" | xargs rg 'self\.workflow\.dryrun'

# Search for tests covering dry run scenarios with storage in the tests/ directory
fd --extension py --type file tests/ | xargs rg 'def test.*dryrun.*storage'

Length of output: 772


Script:

#!/bin/bash
# Search for occurrences of self.workflow.dryrun in snakemake/dag.py
fd --extension py --type file | grep "^snakemake/dag\.py$" | xargs rg 'self\.workflow\.dryrun'

# Search for tests covering dry run scenarios with storage in the tests/ directory using full paths
fd --extension py --type file --full-path 'tests/' | xargs rg 'def test.*dryrun.*storage'

Length of output: 454


Script:

#!/bin/bash
# Search for any tests related to dry run scenarios in the tests/ directory
fd --extension py --type file --full-path 'tests/' | xargs rg 'def test.*dryrun'

Length of output: 173

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between f986e63 and c67ba83.

Files selected for processing (3)
  • tests/test_storage_noretrieve_dryrun/Snakefile (1 hunks)
  • tests/test_storage_noretrieve_dryrun/fs/test.txt (1 hunks)
  • tests/tests.py (2 hunks)
Files skipped from review due to trivial changes (1)
  • tests/test_storage_noretrieve_dryrun/fs/test.txt
Additional context used
Path-based instructions (1)
tests/tests.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Additional comments not posted (6)
tests/test_storage_noretrieve_dryrun/Snakefile (4)

2-3: LGTM: Storage configuration is appropriate for testing.

The storage configuration using the "fs" provider is suitable for testing filesystem operations in the context of dry runs.


6-8: LGTM: Rule all correctly defines the workflow's final output.

The rule all appropriately uses the storage() function to specify the final output, which is essential for testing storage behavior during dry runs.


18-24: LGTM: Rule bar correctly implements storage operations.

The rule bar appropriately uses the storage() function for both input and output, and the shell command correctly utilizes the {input} and {output} placeholders. This setup is well-suited for testing storage behavior during dry runs.


1-24: Overall assessment: Well-structured test case with a minor issue.

This Snakefile provides a well-structured test case for examining storage behavior during dry runs, which aligns perfectly with the PR objectives. The implementation includes appropriate use of storage configurations, checkpoints, and rules. However, there's a minor issue in the foo checkpoint where the shell command doesn't write to the correct storage location.

Once the checkpoint issue is addressed, this test case should effectively validate the changes related to omitting storage downloads during dry runs in workflows with checkpoints.

To ensure that this test case is properly integrated into the test suite, please run the following command:

This will help confirm that the new test case is referenced in the test suite configuration.

Verification successful

Test case is correctly included in the test suite.

The execution of the verification script confirms that test_storage_noretrieve_dryrun is referenced in tests/tests.py, ensuring the test case is properly integrated into the test suite.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify that the test case is included in the test suite
grep -r "test_storage_noretrieve_dryrun" tests

Length of output: 182

tests/tests.py (2)

12-12: New import added for mocking.

The AsyncMock class has been imported from the unittest.mock module. This is appropriate for mocking asynchronous functions or methods.


1471-1474: New test function added to simulate storage retrieval failure during dry run.

This test function, test_storage_noretrieve_dryrun, uses the @patch decorator to mock the retrieve_from_storage method of snakemake.io._IOFile. The mock is set to raise an exception, simulating a failure in storage retrieval. This is a good approach to test the behavior of Snakemake during a dry run when storage retrieval fails.

A few observations and suggestions:

  1. The test function name clearly describes its purpose.
  2. The use of AsyncMock is appropriate for mocking an asynchronous method.
  3. The test runs Snakemake in dry run mode, which is correct for this scenario.

To ensure that this test is correctly integrated and doesn't interfere with other tests, we can check for any related configurations or setup in the test directory:

Verification successful

Test Integration Verified Successfully

The new test function test_storage_noretrieve_dryrun is correctly integrated and does not interfere with other tests. No related configurations or modifications were found that could affect the test's behavior.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any related configurations or setup files
echo "Checking for related test configurations:"
find . -maxdepth 1 -name "test_storage_noretrieve_dryrun*"
echo "Checking for any modifications to test runner or configuration files:"
grep -n "test_storage_noretrieve_dryrun" *.py

Length of output: 366


Script:

#!/bin/bash
# Searching for related test configurations recursively
echo "Searching for related test configurations recursively:"
fd "test_storage_noretrieve_dryrun" . -t f --extension py

# Searching for any modifications to test runner or configuration files recursively
echo "Searching for modifications related to 'test_storage_noretrieve_dryrun' in Python files:"
rg -n "test_storage_noretrieve_dryrun" --type py

Length of output: 564

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@sonarqubecloud
Copy link
Copy Markdown

@johanneskoester johanneskoester merged commit 151216a into main Sep 25, 2024
@johanneskoester johanneskoester deleted the fix/checkpoint-dryrun-no-storage-download branch September 25, 2024 08:01
johanneskoester pushed a commit that referenced this pull request Sep 25, 2024
🤖 I have created a release *beep* *boop*
---


##
[8.20.5](v8.20.4...v8.20.5)
(2024-09-25)


### Bug Fixes

* fixed check for remote conda env pinning and post-deploy files; fixed
conda env cleanup
([#3103](#3103))
([4d0a7e9](4d0a7e9))
* omit storage downloads during dryrun in workflows with checkpoints
([#3100](#3100))
([151216a](151216a))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Not download file durring dry-run

1 participant