Skip to content

feat: more robust parameter and code change detection and transparent reporting of detected changes#3140

Merged
johanneskoester merged 13 commits intomainfrom
feat/params-change-detection-impr
Oct 14, 2024
Merged

feat: more robust parameter and code change detection and transparent reporting of detected changes#3140
johanneskoester merged 13 commits intomainfrom
feat/params-change-detection-impr

Conversation

@johanneskoester
Copy link
Copy Markdown
Contributor

@johanneskoester johanneskoester commented Oct 14, 2024

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

Summary by CodeRabbit

  • New Features

    • Enhanced tracking of job parameters within the Snakemake framework.
    • Introduced a new ParamsChange dataclass for improved clarity in parameter change tracking.
    • Added logging functions to track jobs with outdated or missing metadata.
    • Implemented automatic updates of Conda in the CI workflow.
  • Improvements

    • Simplified logic for job execution reasons, improving readability and maintainability.
    • Updated job parameter change reporting for more detailed insights during execution.
    • Enhanced visibility of job metadata status during workflow execution for better debugging.
    • Introduced new tests to validate parameter handling and metadata management.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Oct 14, 2024

Walkthrough

The changes made in the Snakemake framework involve enhancements to job parameter tracking and management across several files: dag.py, jobs.py, persistence.py, and workflow.py. The update_needrun method in dag.py has been refactored for improved logic and readability. The Reason class in jobs.py has been updated to better represent parameter changes, while a new ParamsChange dataclass has been introduced in persistence.py to encapsulate parameter modifications. Additionally, new logging functions have been added in workflow.py to improve visibility regarding job metadata status. Updates to the CI workflow and new tests have also been implemented.

Changes

File Path Change Summary
snakemake/dag.py Refactored update_needrun method in DAG class for improved logic and added checks for outdated metadata.
snakemake/jobs.py Updated Reason class to import NO_PARAMS_CHANGE, modified __init__ and __str__ methods for better parameter change representation.
snakemake/persistence.py Introduced ParamsChange dataclass and updated params_changed method to utilize this new class; added has_outdated_metadata method.
snakemake/workflow.py Added log_outdated_metadata_info, log_missing_metadata_info, and log_metadata_info methods to enhance logging during workflow execution.
.github/workflows/main.yml Updated CI workflow to include auto-update-conda: true for environment setup and adjusted formatting.
tests/test_params_outdated_code/.snakemake/metadata/c29tZWRpci90ZXN0Lm91dA== Introduced new JSON object defining a rule for testing.
tests/test_params_outdated_code/Snakefile Added a new Snakefile with tests for parameters syntax in a Snakemake workflow.
tests/tests.py Added test_params_outdated_code(mocker) function to test has_outdated_metadata functionality.

Possibly related PRs


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
snakemake/persistence.py (1)

444-454: Update the docstring for params_changed method

The params_changed method now returns a ParamsChange object instead of a boolean. Consider updating the method's docstring to reflect this change and provide guidance on how to interpret the returned value.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 7e6e0f8 and 29598a1.

📒 Files selected for processing (3)
  • snakemake/dag.py (1 hunks)
  • snakemake/jobs.py (2 hunks)
  • snakemake/persistence.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
snakemake/dag.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

snakemake/jobs.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

snakemake/persistence.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🔇 Additional comments (9)
snakemake/jobs.py (4)

1648-1649: LGTM: Import of NO_PARAMS_CHANGE

The import of NO_PARAMS_CHANGE from snakemake.persistence is a good addition. It suggests that the parameter change tracking has been refined to provide more detailed information.


1655-1655: Improved parameter change tracking

The initialization of self.params_changed with NO_PARAMS_CHANGE instead of a boolean value indicates a more nuanced approach to tracking parameter changes. This change allows for more detailed reporting of parameter modifications.


1783-1785: Enhanced parameter change reporting

The __str__ method now includes the actual value of self.params_changed in the output string. This provides more detailed information about the nature of parameter changes, improving the clarity of the reason for job re-execution.


Line range hint 1648-1785: Overall assessment: Improved parameter change handling

The changes in the Reason class enhance Snakemake's ability to track and report parameter changes more precisely. This improvement allows for more transparent reporting of detected changes, aligning well with the pull request's objectives. The modifications are well-implemented and do not introduce any apparent issues or inconsistencies.

snakemake/persistence.py (4)

7-7: Importing dataclass is necessary

The addition of dataclass from the dataclasses module is appropriate for defining the ParamsChange class later in the code.


20-20: Added typing imports for type annotations

The inclusion of Any, Optional, and Set from the typing module enhances type annotations and code readability.


26-26: Importing TBDString for parameter handling

The import of TBDString is necessary for handling unrepresentable parameters in the _serialize_param_builtin method.


515-535: Confirm the comparison with TBDString

In the _serialize_param_builtin method, you check param is not TBDString to exclude TBDString parameters. Ensure that this comparison correctly excludes all instances of TBDString. If TBDString can have multiple instances, consider using not isinstance(param, TBDString) instead.

snakemake/dag.py (1)

1307-1307: Direct assignment to reason.params_changed is appropriate

The direct assignment of reason.params_changed improves code clarity and is appropriate if self.workflow.persistence.params_changed(job) returns a boolean value.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
snakemake/persistence.py (1)

444-454: LGTM: Improved parameter change tracking

The refactoring of the params_changed method significantly improves its clarity and functionality. The use of the ParamsChange object provides a more structured way to track parameter changes across job executions.

Consider adding a comment explaining the purpose of the NO_PARAMS_CHANGE constant and its usage here. This would enhance the readability of the code.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 29598a1 and 4f5907b.

📒 Files selected for processing (1)
  • snakemake/persistence.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake/persistence.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🔇 Additional comments (4)
snakemake/persistence.py (4)

515-535: LGTM: Improved parameter serialization logic

The addition of TBDString to the condition in _serialize_param_builtin method ensures that it's treated as an unrepresentable parameter. This change aligns well with the new parameter serialization logic and improves the robustness of the method.


725-725: LGTM: Useful constant for no parameter changes

The addition of the NO_PARAMS_CHANGE constant provides a convenient and clear way to represent the absence of parameter changes. It's effectively used in the params_changed method as an initial value.


7-7: LGTM: Necessary imports added

The new imports (dataclass, field, Any, Optional, Set, and TBDString) have been correctly added to support the new ParamsChange dataclass, type hinting, and the updated _serialize_param_builtin method. These additions enhance the code's type safety and functionality.

Also applies to: 20-20, 26-26


Line range hint 1-725: Overall: Excellent enhancements to parameter change tracking and persistence

The changes in this file significantly improve the parameter change tracking and serialization logic in the Snakemake persistence module. The introduction of the ParamsChange dataclass and the refactoring of the params_changed method provide a more structured and robust approach to handling parameter changes across job executions.

Key improvements:

  1. Enhanced parameter serialization with TBDString handling
  2. Refactored params_changed method for better clarity and functionality
  3. Well-designed ParamsChange dataclass for encapsulating parameter changes
  4. Addition of the NO_PARAMS_CHANGE constant for convenience

These changes align well with the PR objective of enhancing the detection of parameter changes and ensuring more transparent reporting.

To ensure that the changes are consistent across the codebase, please run the following script to check for any other occurrences of params_changed that might need updating:

This will help verify that the new ParamsChange return value is handled correctly throughout the codebase.

✅ Verification successful

The occurrences of params_changed in snakemake/dag.py are consistent with the recent changes in persistence.py. Both instances correctly utilize the updated method signature:

  1. self.workflow.persistence.params_changed(job) – Uses the required job parameter.
  2. elif self.workflow.persistence.params_changed(job, file=f): – Appropriately provides both job and the optional file parameter.

These usages align with the enhancements made to the parameter change tracking and serialization logic, ensuring robust and clear handling of parameter changes across the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for other occurrences of params_changed that might need updating

# Search for params_changed method calls
rg "params_changed\(" --type python -g '!persistence.py'

Length of output: 89


Script:

#!/bin/bash
# Description: Check for other occurrences of params_changed that might need updating

# Search for params_changed method calls in Python files excluding persistence.py
rg "params_changed\(" --glob "*.py" --glob "!snakemake/persistence.py"

Length of output: 268


Script:

#!/bin/bash
# Description: Extract surrounding context for params_changed calls in snakemake/dag.py

# Extract lines around the params_changed calls for better context
rg "params_changed\(" --context 3 snakemake/dag.py

Length of output: 894

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

Please format your code with black: black snakemake tests/*.py.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (2)
snakemake/dag.py (2)

1309-1309: Consider evaluating parameter changes regardless of outdated metadata

Currently, parameter changes are checked only after verifying if the metadata is outdated. If parameters can change independently of metadata status, consider checking for parameter changes regardless of whether the metadata is outdated to ensure all relevant changes are detected.


Line range hint 1306-1317: Review the logical flow for rerun triggers

The code checks for outdated metadata first and then evaluates other rerun triggers like PARAMS, INPUT, CODE, and SOFTWARE_ENV. Ensure that this sequence correctly handles all scenarios where a job needs to be rerun, especially if some triggers might require action even when metadata is up-to-date.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 4f5907b and 6298c9d.

📒 Files selected for processing (4)
  • snakemake/dag.py (1 hunks)
  • snakemake/jobs.py (3 hunks)
  • snakemake/persistence.py (11 hunks)
  • snakemake/workflow.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • snakemake/jobs.py
🧰 Additional context used
📓 Path-based instructions (3)
snakemake/dag.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

snakemake/persistence.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

snakemake/workflow.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🔇 Additional comments (13)
snakemake/persistence.py (11)

7-7: Imports Added Correctly

The imports for dataclass, field, and the typing annotations (Any, Optional, Set) are appropriately added to support the new code changes.

Also applies to: 20-20


26-26: Import TBDString for Parameter Serialization

Importing TBDString ensures that parameters of this type are properly handled during serialization, preventing potential issues with unrepresentable objects.


36-36: Define RECORD_FORMAT_VERSION Constant

Defining the RECORD_FORMAT_VERSION as 3 establishes a clear versioning system for metadata records, facilitating future compatibility checks and migrations.


329-329: Include record_format_version in Metadata

Adding record_format_version to the metadata ensures that each record carries version information, which is essential for detecting outdated metadata and maintaining data integrity.


400-405: Add has_outdated_metadata Method

The has_outdated_metadata method correctly checks if any output files have metadata with a version older than the current RECORD_FORMAT_VERSION. This function enhances the system's ability to detect and handle outdated records.


427-429: Add record_format_version Method

The record_format_version method provides a convenient way to retrieve the metadata's format version for a given path, improving code readability and maintainability.


455-465: Refactor params_changed Method to Use ParamsChange

The params_changed method now utilizes the ParamsChange dataclass to accumulate parameter differences across output files. This refactoring enhances the clarity and robustness of parameter change detection.


477-480: Adjust _code_changed Method for Version Compatibility

By checking the record_format_version, the _code_changed method appropriately handles cases where reliable code comparisons are not possible due to version differences. This prevents false positives in code change detection.


508-512: Modify _code Method to Consider Only Shell Commands

Focusing on shell commands in the _code method is a sensible approach, as it avoids issues with pickling Python code across different versions. This change enhances the reliability of code tracking.


533-553: Update _serialize_param_builtin to Handle TBDString

By treating TBDString as unrepresentable, the _serialize_param_builtin method prevents potential serialization issues, ensuring that incomplete or placeholder strings do not cause errors during parameter serialization.


699-735: Implement ParamsChange Dataclass

The new ParamsChange dataclass is well-designed, with appropriate default initialization of only_old and only_new as empty sets. The methods __bool__, __or__, and __str__ are correctly implemented, providing useful functionality for parameter change detection and representation.

snakemake/dag.py (2)

1306-1307: Ensure has_outdated_metadata accurately detects outdated jobs

The new condition if self.workflow.persistence.has_outdated_metadata(job): introduces a check for outdated metadata. Confirm that the has_outdated_metadata method correctly identifies when a job's metadata is outdated to prevent unnecessary job executions or missing necessary updates.


Line range hint 1309-1317: Ensure comprehensive handling of all rerun triggers

Make sure that all specified rerun triggers (PARAMS, INPUT, CODE, SOFTWARE_ENV) are thoroughly checked and that reason is updated accordingly for each. This guarantees that any changes necessitating a job rerun are appropriately captured.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 6298c9d and 644d7d5.

📒 Files selected for processing (1)
  • snakemake/workflow.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake/workflow.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (6)
snakemake/persistence.py (4)

400-405: LGTM: Added metadata version check

The has_outdated_metadata method is a valuable addition for identifying outdated metadata. This can be crucial for triggering updates or migrations when necessary.

Consider using any() with a generator expression for a slight optimization:

def has_outdated_metadata(self, job: Job) -> bool:
    return any(
        self.metadata(path).get("record_format_version", 0) < RECORD_FORMAT_VERSION
        for path in job.output
    )

This change would short-circuit the evaluation as soon as an outdated record is found.


455-465: LGTM: Improved parameter change detection

The refactoring of params_changed method significantly improves the clarity and functionality of parameter change detection. The use of the ParamsChange class allows for more detailed reporting of changes.

Consider using a set comprehension for a slight optimization:

new = {param for param in self._params(job)}

This change would create the set directly, potentially improving performance for larger parameter sets.


Line range hint 477-512: LGTM: Improved code change detection with version check

The changes to _code_changed method improve the reliability of code change detection by checking the record format version and focusing on shell commands. This avoids issues with Python version differences in code pickling.

Note: This change might miss code changes in pure Python rules. Consider adding a comment explaining this limitation and the rationale behind the decision.

🧰 Tools
🪛 Ruff

506-506: Use of functools.lru_cache or functools.cache on methods can lead to memory leaks

(B019)


514-514: Use of functools.lru_cache or functools.cache on methods can lead to memory leaks

(B019)


702-738: LGTM: Well-designed ParamsChange dataclass

The new ParamsChange dataclass is a well-designed addition that effectively encapsulates the concept of parameter changes. The methods for boolean evaluation, union operations, and string representation are comprehensive and well-implemented.

Consider adding type hints to the methods for improved clarity:

def __bool__(self) -> bool:
    return bool(self.only_old or self.only_new)

def __or__(self, other: 'ParamsChange') -> 'ParamsChange':
    # ... existing implementation ...

def __str__(self) -> str:
    # ... existing implementation ...

These type hints would enhance code readability and IDE support.

snakemake/workflow.py (2)

1322-1323: Good use of abstraction, but consider further refactoring.

The log_outdated_metadata_info method, like log_missing_metadata_info, effectively uses the generic log_metadata_info method. This approach is good for maintainability and follows the DRY principle.

However, given the similarity between log_missing_metadata_info and log_outdated_metadata_info, consider refactoring to reduce duplication further:

You could create a single method that handles both cases:

def log_metadata_status(self, status):
    self.log_metadata_info(f"{status}_metadata", status)

# Usage:
self.log_metadata_status("missing")
self.log_metadata_status("outdated")

This approach would eliminate the need for separate methods while maintaining flexibility.


1325-1349: Informative method, but consider refactoring for improved structure and consistency.

The log_provenance_info method provides valuable information about provenance-triggered jobs and guidance on modifying this behavior. However, there are a few suggestions for improvement:

  1. The method is quite long. Consider breaking it down into smaller, more focused methods for better readability and maintainability.

  2. String formatting is inconsistent. Stick to f-strings throughout for better readability and performance.

  3. The calls to log_missing_metadata_info and log_outdated_metadata_info at the end seem disconnected from the main purpose of this method. Consider moving these to a separate method or clarify their relationship to provenance info.

Here's a suggested refactoring:

def log_provenance_info(self):
    self._log_provenance_triggered_jobs()
    self._log_additional_metadata_info()

def _log_provenance_triggered_jobs(self):
    provenance_triggered_jobs = [
        job
        for job in self.dag.needrun_jobs(exclude_finished=False)
        if self.dag.reason(job).is_provenance_triggered()
    ]
    if provenance_triggered_jobs:
        logger.info(
            "Some jobs were triggered by provenance information, "
            "see 'reason' section in the rule displays above.\n"
            "If you prefer that only modification time is used to "
            "determine whether a job shall be executed, use the command "
            "line option '--rerun-triggers mtime' (also see --help).\n"
            "If you are sure that a change for a certain output file (say, <outfile>) won't "
            "change the result (e.g. because you just changed the formatting of a script "
            "or environment definition), you can also wipe its metadata to skip such a trigger via "
            "'snakemake --cleanup-metadata <outfile>'. "
        )
        logger.info(
            f"Rules with provenance triggered jobs: "
            f"{' '.join(jobs_to_rulenames(provenance_triggered_jobs))}"
        )
        logger.info("")

def _log_additional_metadata_info(self):
    self.log_missing_metadata_info()
    self.log_outdated_metadata_info()

This refactoring improves readability and separates concerns more clearly.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 644d7d5 and ac491d0.

📒 Files selected for processing (2)
  • snakemake/persistence.py (11 hunks)
  • snakemake/workflow.py (4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
snakemake/persistence.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

snakemake/workflow.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🔇 Additional comments (7)
snakemake/persistence.py (3)

7-7: LGTM: Improved imports and versioning

The new imports enhance type hinting, and the introduction of RECORD_FORMAT_VERSION suggests a good practice for managing data structure changes.

Also applies to: 20-20, 26-26, 31-31, 36-36


329-329: LGTM: Added record format versioning

Adding the record_format_version to the record is a good practice for maintaining backward compatibility and facilitating future data migrations.


427-429: LGTM: Added record format version getter

The record_format_version method provides a clean and convenient way to access the record format version for a specific path. This addition enhances the API for version-related operations.

snakemake/workflow.py (4)

1304-1317: Well-structured and efficient implementation.

The log_metadata_info method is a well-implemented helper function for logging metadata information about jobs. It uses efficient list comprehension for filtering jobs and modern f-strings for formatting the log message. The function is also generic, allowing it to handle different types of metadata attributes.


1319-1320: Good use of abstraction.

The log_missing_metadata_info method effectively uses the more generic log_metadata_info to handle the specific case of missing metadata. This approach follows the DRY principle and improves code maintainability.


Line range hint 1-1350: Summary: Valuable enhancements to metadata and provenance logging

The changes to snakemake/workflow.py introduce new logging capabilities for metadata and provenance information. These enhancements provide users with more detailed insights into job status and execution triggers. The implementation is generally well-structured and follows good coding practices.

Key points:

  1. New helper methods for logging metadata information are concise and reusable.
  2. The log_provenance_info method, while informative, could benefit from some refactoring for improved readability and maintainability.
  3. Integration of new logging calls in the execute method is appropriate, but performance impact should be considered for large workflows.

Overall, these changes should improve the user experience by providing more comprehensive information about the workflow execution process.


1247-1248: Improved logging with additional metadata and provenance information.

The new method calls self.log_missing_metadata_info(), self.log_outdated_metadata_info(), and self.log_provenance_info() enhance the logging capabilities of the workflow execution. They provide valuable information about the metadata status of jobs and provenance information.

The placement of these calls within the execute method is appropriate:

  1. Logging missing and outdated metadata when there's nothing to be done.
  2. Logging provenance information in the dryrun case.

Consider the performance impact of these new logging calls, especially for large workflows. It might be beneficial to add some basic performance metrics. You can use the following script to check the execution time of these new logging methods:

This will help ensure that the new logging doesn't introduce significant overhead, especially for large workflows.

Also applies to: 1263-1263

@johanneskoester johanneskoester changed the title feat: more robust parameter change detection and transparent reporting of detected changes feat: more robust parameter and code change detection and transparent reporting of detected changes Oct 14, 2024
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)
tests/test_params_outdated_code/Snakefile (2)

10-13: LGTM: Well-structured rule with advanced Snakemake features.

This rule effectively demonstrates advanced Snakemake features:

  • Use of lambda function in params
  • Wildcard in output for dynamic file creation
  • Complex shell command for testing

The structure is correct and serves the purpose of testing params syntax well.

Minor suggestion: Consider adding a name to this rule for better readability and easier referencing in larger workflows.


1-13: Effective test file for Snakemake params syntax.

This Snakefile serves its purpose well as a test for params syntax in Snakemake:

  1. It demonstrates a simple rule structure with only input.
  2. It showcases a more complex rule with lambda functions in params, wildcards in output, and a multi-step shell command.

The file is concise yet covers different aspects of Snakemake syntax, making it a good unit test for the params feature. It aligns well with the PR objective of enhancing parameter change detection.

As the codebase grows, consider organizing these test Snakefiles into a structured test suite. This will help maintain test coverage and make it easier to run specific tests for different Snakemake features.

.github/workflows/main.yml (2)

67-67: LGTM! Consider version pinning for reproducibility.

The addition of auto-update-conda: true ensures that the CI environment uses the latest version of conda, which can include important bug fixes and improvements. This aligns well with the PR's objective of enhancing robustness.

For improved reproducibility, consider pinning the conda version to a specific release instead of always using the latest. This can be done by specifying a version in the miniforge-version parameter. For example:

- name: Setup snakemake environment
  uses: conda-incubator/setup-miniconda@v3
  with:
    miniforge-version: 4.10.3-10
    environment-file: test-environment-${{ matrix.py_ver }}.yml
    environment-name: snakemake
    auto-update-conda: false

This approach balances the benefits of using a recent, known-good version of conda with the stability and reproducibility of a fixed environment.


161-161: LGTM! Consider version pinning for reproducibility (Windows).

The addition of auto-update-conda: true for the Windows environment mirrors the change made for the Linux environment. This consistency is good for maintaining parity across different operating systems.

As suggested for the Linux environment, consider pinning the conda version for the Windows environment as well. This ensures consistency and reproducibility across both platforms. For example:

- name: Setup snakemake environment
  uses: conda-incubator/setup-miniconda@v3
  with:
    miniforge-version: 4.10.3-10
    environment-file: test-environment.yml
    environment-name: snakemake
    auto-update-conda: false

This approach would provide a consistent, reproducible environment for both Linux and Windows CI runs.

tests/tests.py (1)

267-271: New test function implemented correctly.

The test_params_outdated_code(mocker) function is well-implemented to test the behavior of the has_outdated_metadata method in the Persistence class. It correctly uses the mocker fixture to create a spy and runs the appropriate test case.

However, there's a minor improvement that can be made to the assertion:

Consider changing the assertion to use Python's built-in boolean comparison instead of comparing to True explicitly:

-    assert spy.spy_return == True
+    assert spy.spy_return

This change makes the code more idiomatic and slightly more efficient.

🧰 Tools
🪛 Ruff

271-271: Avoid equality comparisons to True; use if spy.spy_return: for truth checks

Replace with spy.spy_return

(E712)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between ac491d0 and d92f440.

⛔ Files ignored due to path filters (3)
  • tests/test_params_outdated_code/.snakemake/log/2024-10-14T123852.529917.snakemake.log is excluded by !**/*.log
  • tests/test_params_outdated_code/expected-results/somedir/test.out is excluded by !**/*.out
  • tests/test_params_outdated_code/somedir/test.out is excluded by !**/*.out
📒 Files selected for processing (4)
  • .github/workflows/main.yml (2 hunks)
  • tests/test_params_outdated_code/.snakemake/metadata/c29tZWRpci90ZXN0Lm91dA== (1 hunks)
  • tests/test_params_outdated_code/Snakefile (1 hunks)
  • tests/tests.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
tests/tests.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🪛 Ruff
tests/tests.py

271-271: Avoid equality comparisons to True; use if spy.spy_return: for truth checks

Replace with spy.spy_return

(E712)

🔇 Additional comments (7)
tests/test_params_outdated_code/Snakefile (2)

1-5: LGTM: Clear docstring and appropriate shell executable setting.

The docstring succinctly explains the purpose of the file, which is good for maintainability. Setting the shell executable to "bash" is a common and appropriate practice in Snakefiles.


7-8: Consider adding a name and output/action to the first rule.

While Snakemake allows unnamed rules, adding a name would improve readability and make it easier to reference this rule. Additionally, this rule only specifies an input, which is unusual. Typically, rules have at least an output or action. Consider adding these elements to make the rule's purpose clearer.

To ensure this rule is intentional and not missing any elements, you can run the following command:

This will help verify if this input file is used elsewhere in the workflow.

tests/test_params_outdated_code/.snakemake/metadata/c29tZWRpci90ZXN0Lm91dA== (4)

1-1: Verify timestamp accuracy

The starttime and endtime values (1728902332.536915 and 1728902332.541915) correspond to dates in 2024, which is in the future. This might indicate an issue with the system clock or how timestamps are being generated.

To check if this is an isolated incident or a pattern, run:

#!/bin/bash
# Search for timestamps in other metadata files
rg --type json '"starttime":\s*\d+\.\d+' .snakemake/metadata | head -n 10

1-1: Review empty input array and checksums

The input array and input_checksums object are both empty. While this might be intentional, it's worth verifying:

  1. If the rule truly doesn't require any input files.
  2. If input file integrity checks are being skipped unintentionally.

To check if this is common in other metadata files, run:

#!/bin/bash
# Count metadata files with empty input arrays and checksums
echo "Files with empty input:"
rg --type json '"input":\s*\[\]' .snakemake/metadata | wc -l
echo "Files with empty input_checksums:"
rg --type json '"input_checksums":\s*\{\}' .snakemake/metadata | wc -l

1-1: ⚠️ Potential issue

Review parameter formatting

The params array contains strings with single quotes: ["'-f'", "'somedir'"]. This unusual formatting might cause issues when these parameters are used.

Consider removing the single quotes:

-"params": ["'-f'", "'somedir'"],
+"params": ["-f", "somedir"],

To check if this is a consistent pattern in the codebase, run:

✅ Verification successful

Review Parameter Formatting Verified

The params array formatting issue is present in the specified file, and no similar patterns were found in other metadata files.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for similar parameter formatting in other metadata files
rg --type json '"params":\s*\[.*'.*'.*\]' .snakemake/metadata

Length of output: 393


1-1: ⚠️ Potential issue

Caution: Potentially risky shell command

The shell command rm -r somedir; mkdir -p somedir; touch -f somedir/test.out involves removing and recreating a directory. This operation can be risky if not handled carefully, especially in a production environment.

Consider the following suggestions:

  1. Use safer alternatives like rm -rf to suppress errors if the directory doesn't exist.
  2. Add error checking or use set -e to ensure the script stops if any command fails.
  3. Consider using mktemp for creating temporary directories to avoid potential conflicts.

To verify the usage of this command across the codebase, run:

✅ Verification successful

Verification Completed: No Widespread Risk Identified

The search for rm -r commands revealed usage only in Dockerfile and Makefile files, which are appropriate contexts for such operations. No instances were found in Python files or other parts of the main codebase.

  • The rm -r command in the reviewed Snakemake metadata file appears to be an isolated case.
  • Ensure that such shell commands are used cautiously and are necessary for the workflow.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for similar rm -r commands in the codebase
rg --type python "rm\s+-r"

Length of output: 59


Script:

#!/bin/bash
# Search for similar rm -r commands in all files
rg "rm\s+-r"

# Alternatively, search specifically in Python files by extension
rg "rm\s+-r" --glob "*.py"

Length of output: 198

tests/tests.py (1)

16-16: Import statement added correctly.

The new import statement from snakemake.persistence import Persistence is correctly placed and necessary for the new test function.

@johanneskoester johanneskoester merged commit 576f588 into main Oct 14, 2024
@johanneskoester johanneskoester deleted the feat/params-change-detection-impr branch October 14, 2024 13:24
johanneskoester added a commit that referenced this pull request Oct 14, 2024
🤖 I have created a release *beep* *boop*
---


##
[8.23.0](v8.22.0...v8.23.0)
(2024-10-14)


### Features

* More robust parameter and code change detection and transparent
reporting of detected changes
([#3140](#3140)). For
setups using remote storage, this will lead to missing metadata records
because the key used for storing the metadata is now the remote storage
query instead of the path to the local copy. The reason is that the
latter can be user-specific and therefore previously could have let led
to e.g. different rerun behavior for different users.


### Documentation

* update installation docs to reflect changes in conda ecosystem
([#3141](#3141))
([e83b8aa](e83b8aa))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
@royjacobson
Copy link
Copy Markdown
Contributor

Hi Johannes! @johanneskoester

Is the change to _code permanent? Not having fine grained re-running of rules based on changed python code has really degraded my workflow, I just realized this commit is likely the reason for that. For some reason the mtime thing doesn't reliably work either, but even if it did I think it would just rerun all the rules in the same file which is unwanted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants