Skip to content

feat: add support for validation of polars dataframe and lazyframe#3262

Merged
johanneskoester merged 16 commits intosnakemake:mainfrom
fgvieira:polars_validate
Mar 11, 2025
Merged

feat: add support for validation of polars dataframe and lazyframe#3262
johanneskoester merged 16 commits intosnakemake:mainfrom
fgvieira:polars_validate

Conversation

@fgvieira
Copy link
Copy Markdown
Contributor

@fgvieira fgvieira commented Jan 22, 2025

Implement validation for polars DataFrame and LazyFrame.

Re-factoring the setting of default values.

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

Summary by CodeRabbit

  • New Features

    • Enhanced sample data validation now supports both Pandas and Polars data frames for improved reliability and performance.
    • Introduced new methods for reading sample data using Polars, expanding data handling options.
    • Added support for executing Xonsh scripts within Snakemake workflows.
    • New rule added for running Python scripts with Conda environments.
    • New functionality for generating self-contained HTML reports, including default statistics and user-specified results.
    • New functions added for parsing input files and extracting checksums.
  • Bug Fixes

    • Improved error handling for validation failures, providing more specific error messages.
  • Documentation

    • Updated the sample metadata schema with new fields for replicate count and tissue origin, alongside a refined description for sample condition.
    • Clarified usage of conda environments and apptainer integration within Snakemake workflows.
    • Expanded guidance on generating, customizing, and sharing reports in Snakemake.
    • Added documentation for integrating Xonsh scripts into Snakemake rules.
    • Updated help text for the --keep-storage-local-copies argument to enhance clarity and usability.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 22, 2025

📝 Walkthrough

Walkthrough

The changes update test code and utility functions to support sample data handling using both Pandas and Polars libraries. In the tests (in the Snakefile), new methods for reading sample data into dictionaries and validating them against a schema are introduced. The utility module's validate function is enhanced to branch based on whether the data is a Pandas or Polars DataFrame, including refined error messages and record processing. The sample schema is updated by simplifying the 'condition' description and introducing new 'n' and 'tissue' properties.

Changes

Files Change Summary
tests/test_validate/[Snakefile, samples.schema.yaml] Added methods for reading sample data into dictionaries and validating them in the Snakefile; updated the schema by simplifying the 'condition' property and adding 'n' (integer) and 'tissue' (string) properties.
snakemake/utils.py Enhanced the validate function to support both Pandas and Polars DataFrames by introducing helper functions for validation, refining error handling, and adjusting control flow based on data type.
docs/snakefiles/[deployment.rst, reporting.rst, rules.rst] Updated documentation to clarify the usage of conda environments and apptainer integration, enhance report generation descriptions, and introduce Xonsh script integration in Snakemake rules.
snakemake/cli.py Modified the help text for the --keep-storage-local-copies argument to broaden its scope to include both input and output files.
snakemake/script/init.py Introduced a new class XonshScript for executing Xonsh scripts, updating language detection and error messages accordingly.
snakemake/workflow.py Modified the execute method to conditionally clean up storage objects based on the keep_storage_local setting.
tests/test_conda_run/[Snakefile, expected-results/test.txt, test_python_env.yaml, test_script_run.py] Introduced new rules and scripts for testing Conda deployment, including a new expected results file and environment configuration.
tests/test_script_xsh/[Snakefile, envs/xonsh.yaml, scripts/test.xsh] Created new files to define rules for workflows using Xonsh scripts and manage the associated Conda environment.
snakemake/ioutils/[init.py, input.py] Added new utility functions parse_input and extract_checksum to the input module for improved handling of input files and checksum extraction.
tests/test_ioutils/[Snakefile, expected-results/c/1.txt, expected-results/results/switchsomeswitch.columnsample.txt, samples.md5] Updated tests to verify the output of the extract_checksum function and introduced a new file for MD5 checksums associated with TSV files.

Possibly related PRs

  • test: extend validation test to modules #3343: The changes in the main PR enhance the functionality of the validate function in tests/test_validate/Snakefile to handle both polars and pandas, while the retrieved PR modifies the validate function in snakemake/utils.py to support similar functionality for both data frame types, indicating a direct relationship between the two.

Suggested reviewers

  • johanneskoester

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6ca342 and 82d5d50.

📒 Files selected for processing (19)
  • .github/workflows/main.yml (5 hunks)
  • docs/snakefiles/rules.rst (2 hunks)
  • snakemake/cli.py (4 hunks)
  • snakemake/workflow.py (4 hunks)
  • docs/snakefiles/rules.rst (1 hunks)
  • snakemake/cli.py (1 hunks)
  • snakemake/workflow.py (1 hunks)
  • .github/workflows/codespell.yml (1 hunks)
  • .github/workflows/docs.yml (2 hunks)
  • .github/workflows/main.yml (6 hunks)
  • docs/snakefiles/rules.rst (2 hunks)
  • snakemake/cli.py (1 hunks)
  • snakemake/ioutils/__init__.py (2 hunks)
  • snakemake/ioutils/input.py (1 hunks)
  • snakemake/workflow.py (1 hunks)
  • tests/test_ioutils/Snakefile (4 hunks)
  • tests/test_ioutils/expected-results/c/1.txt (1 hunks)
  • tests/test_ioutils/expected-results/results/switch~someswitch.column~sample.txt (1 hunks)
  • tests/test_ioutils/samples.md5 (1 hunks)
✅ Files skipped from review due to trivial changes (3)
  • tests/test_ioutils/expected-results/c/1.txt
  • tests/test_ioutils/expected-results/results/switchsomeswitch.columnsample.txt
  • tests/test_ioutils/samples.md5
🚧 Files skipped from review as they are similar to previous changes (5)
  • docs/snakefiles/rules.rst
  • snakemake/workflow.py
  • snakemake/workflow.py
  • docs/snakefiles/rules.rst
  • snakemake/cli.py
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest ...

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

  • snakemake/ioutils/__init__.py
  • snakemake/cli.py
  • snakemake/ioutils/input.py
  • snakemake/workflow.py
🪛 Ruff (0.8.2)
snakemake/ioutils/input.py

16-16: Do not assign a lambda expression, use a def

Rewrite fix_file_name as a def

(E731)


30-30: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


30-30: Undefined name WorkflowError

(F821)

🪛 actionlint (1.7.4)
.github/workflows/main.yml

52-52: "exclude" section should not be empty

(syntax-check)


52-52: "exclude" section should not be empty

(syntax-check)

🪛 YAMLlint (1.35.1)
.github/workflows/main.yml

[error] 53-53: trailing spaces

(trailing-spaces)


[error] 218-218: no new line character at the end of file

(new-line-at-end-of-file)

⏰ Context from checks skipped due to timeout of 90000ms (30)
  • GitHub Check: testing (10, 3.12, bash)
  • GitHub Check: testing (10, 3.11, bash)
  • GitHub Check: testing (9, 3.12, bash)
  • GitHub Check: testing (9, 3.11, bash)
  • GitHub Check: testing (8, 3.12, bash)
  • GitHub Check: testing (8, 3.11, bash)
  • GitHub Check: testing (7, 3.12, bash)
  • GitHub Check: testing (7, 3.11, bash)
  • GitHub Check: testing (6, 3.12, bash)
  • GitHub Check: testing-windows (10)
  • GitHub Check: testing (6, 3.11, bash)
  • GitHub Check: testing (5, 3.12, bash)
  • GitHub Check: testing-windows (9)
  • GitHub Check: testing (5, 3.11, bash)
  • GitHub Check: testing-windows (8)
  • GitHub Check: testing (4, 3.12, bash)
  • GitHub Check: testing-windows (7)
  • GitHub Check: testing (4, 3.11, bash)
  • GitHub Check: testing-windows (6)
  • GitHub Check: testing (3, 3.12, bash)
  • GitHub Check: testing-windows (5)
  • GitHub Check: testing (3, 3.11, bash)
  • GitHub Check: testing-windows (4)
  • GitHub Check: testing (2, 3.12, bash)
  • GitHub Check: testing-windows (3)
  • GitHub Check: testing (2, 3.11, bash)
  • GitHub Check: testing-windows (2)
  • GitHub Check: testing (1, 3.12, bash)
  • GitHub Check: testing-windows (1)
  • GitHub Check: testing (1, 3.11, bash)
🔇 Additional comments (25)
snakemake/workflow.py (1)

1294-1295: Good adjustment to respect storage settings.

The added conditional check ensures that storage objects are only cleaned up when keep_storage_local is set to False. This aligns with user expectations when they explicitly want to keep storage objects locally.

snakemake/cli.py (4)

14-14: Clean import addition for caching module

The addition of this import statement is necessary to support the references to caching.LOCATION_ENVVAR used later in the help text.


48-48: Good fix: Uncommenting Batch import

The Batch class is already being used in the parse_batch function at line 243, so uncommented this import is an improvement to make the code more explicit about its dependencies.


363-365: Good fix for version string handling during testing

The added comments and the modified version string parsing correctly handle cases when versioneer returns version strings with a suffix during testing scenarios. Splitting at "+" ensures only the numeric part is considered.


481-481: Improved maintainability for cache environment variable reference

Replacing the hardcoded string $SNAKEMAKE_OUTPUT_CACHE with ${caching.LOCATION_ENVVAR} is a good practice as it centralizes the definition of the environment variable name, making it easier to maintain and update if needed.

snakemake/ioutils/__init__.py (2)

8-8: Adding new utility functions to module exports.

This line imports the new functions parse_input and extract_checksum from the newly created snakemake.ioutils.input module.


25-26: Making utility functions available globally.

These additions register the new utility functions in the global namespace, making them available throughout Snakemake workflows.

snakemake/ioutils/input.py (1)

1-9: Function implementation looks good.

The parse_input function is well-implemented, providing a clean way to parse input files with an optional custom parser function.

tests/test_ioutils/Snakefile (4)

11-13: Added test assertion for extract_checksum function.

The assertion verifies that extract_checksum correctly extracts the checksum for the specified file.


22-29: Configuration for using parse_input and extract_checksum in rule.

The rule now includes checksum file as input and uses parse_input with extract_checksum to process the file.


31-31: Updated shell command to output checksum.

The shell command now outputs the parsed checksum value instead of a static string.


58-58: Added trailing comma for consistent style.

Added a trailing comma to the input declaration, improving consistency in the code.

docs/snakefiles/rules.rst (2)

550-571: Added documentation for parse_input function.

The documentation clearly explains the purpose and usage of the parse_input function, including a practical example.


573-592: Added documentation for extract_checksum function.

The documentation explains how to use the extract_checksum function and includes a practical example showing how to extract checksums from a file.

.github/workflows/codespell.yml (1)

16-16: Conditional Execution: Validate the Conditional Expression

The added if condition ensures that the Codespell job will only run if the pull request is not merged or if the branch is not main. This aligns well with the objective of skipping checks for merged PRs targeting main. Please double-check that the logical OR is the intended behavior for your CI scenarios.

.github/workflows/docs.yml (2)

16-16: Docs Job Conditional Execution

The conditional statement on line 16 ensures that the documentation build for the docs job is skipped when the pull request is merged into main. This improves CI efficiency by avoiding redundant builds on merged PRs.


41-41: API Docs Job Conditional Execution

Similarly, the if condition on line 41 for the apidocs job reliably prevents the job from running when the pull request is merged into main. This maintains consistency across the workflow jobs.

.github/workflows/main.yml (8)

15-20: Formatting Job Conditional Check

The if condition on the formatting job (line 16) is consistent with the other workflow files. It ensures that formatting checks are skipped for merged PRs on the main branch. This looks correct and helps avoid unnecessary runs.


42-53: Matrix Configuration Adjustments

The updated testing matrix now clearly specifies multiple Python versions (3.11, 3.12) and shell environments (bash and dash), including an exclusion rule (using dash with Python 3.11). This setup is well structured. Please verify that the exclusion rule works as intended.

🧰 Tools
🪛 actionlint (1.7.4)

52-52: "exclude" section should not be empty

(syntax-check)

🪛 YAMLlint (1.35.1)

[error] 53-53: trailing spaces

(trailing-spaces)


88-91: APT Packages Installation Update

The installation command on line 91 now installs dash (and omits stress), which appears to align with your revised testing environments. Ensure that all package dependencies work as expected with this updated list.


100-106: MinIO Setup Integration

A new step to set up MinIO using comfuture/minio-action@v1 has been added. This integration step is clear and supports S3 bucket simulation well. Just verify that the access credentials and port configuration match your test requirements.


107-113: MinIO Testing Step

The MinIO test step correctly exports the necessary environment variables and attempts to create a bucket via the AWS CLI. It might be beneficial to consider cleanup after bucket creation if that’s required by your testing strategy.


121-123: Shell Adjustment Based on Matrix Value

The conditional shell change (lines 121–123) that switches the default shell to dash when specified in the matrix is correctly implemented. Ensure that this change operates reliably across all target environments.


156-165: Container Image Build Step

The job for building the container image (lines 156–165) includes the appropriate conditional checks and uses a simple Docker build command. Confirm that your Dockerfile and build context are correctly configured to match your project’s requirements.


211-218: Final Testing Job Confirmation

The testing-done job simply echoes a success message, which is appropriate for a final step ensuring all tests have passed. No issues here.

🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 218-218: no new line character at the end of file

(new-line-at-end-of-file)

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
tests/test_validate/Snakefile (1)

Line range hint 20-22: Critical: Fix incompatible DataFrame access in rule all.

The all rule uses samples.index which is pandas-specific syntax. This will fail with polars DataFrames which have a different API for accessing indices.

Consider either:

  1. Storing the pandas DataFrame in a separate variable, or
  2. Adapting the index access for polars:
rule all:
    input:
-        expand("test.{sample}.txt", sample=samples.index)
+        # Option 1: Keep pandas DataFrame for rule expansion
+        expand("test.{sample}.txt", sample=pd_samples.index)
+        # Option 2: Adapt for polars
+        expand("test.{sample}.txt", sample=samples["sample"].to_list())
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a75043 and 0735b19.

📒 Files selected for processing (1)
  • tests/test_validate/Snakefile (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (30)
  • GitHub Check: testing (10, 3.12)
  • GitHub Check: testing (10, 3.11)
  • GitHub Check: testing (9, 3.12)
  • GitHub Check: testing (9, 3.11)
  • GitHub Check: testing (8, 3.12)
  • GitHub Check: testing (8, 3.11)
  • GitHub Check: testing (7, 3.12)
  • GitHub Check: testing (7, 3.11)
  • GitHub Check: testing (6, 3.12)
  • GitHub Check: testing (6, 3.11)
  • GitHub Check: testing (5, 3.12)
  • GitHub Check: testing-windows (10)
  • GitHub Check: testing (5, 3.11)
  • GitHub Check: testing-windows (9)
  • GitHub Check: testing (4, 3.12)
  • GitHub Check: testing-windows (8)
  • GitHub Check: testing (4, 3.11)
  • GitHub Check: testing-windows (7)
  • GitHub Check: testing (3, 3.12)
  • GitHub Check: testing-windows (6)
  • GitHub Check: testing (3, 3.11)
  • GitHub Check: testing-windows (5)
  • GitHub Check: testing (2, 3.12)
  • GitHub Check: testing-windows (4)
  • GitHub Check: testing (2, 3.11)
  • GitHub Check: testing-windows (3)
  • GitHub Check: testing (1, 3.12)
  • GitHub Check: testing-windows (2)
  • GitHub Check: testing (1, 3.11)
  • GitHub Check: testing-windows (1)
🔇 Additional comments (2)
tests/test_validate/Snakefile (2)

4-4: LGTM! Import statement follows conventions.

The polars import is correctly placed and uses the conventional pl alias.


17-18: Verify LazyFrame validation support and consider performance implications.

The addition of LazyFrame validation raises several concerns:

  1. Verify that the validation utility properly supports LazyFrame objects, as they behave differently from eager DataFrames.
  2. Consider the performance impact of multiple validation calls on the same data.

Consider refactoring to:

  1. Add explicit tests for both DataFrame and LazyFrame validation
  2. Document the behavior differences between eager and lazy validation
-samples = pl.scan_csv(config["samples"])
-validate(samples, "samples.schema.yaml")
+# Test LazyFrame validation
+lazy_samples = pl.scan_csv(config["samples"])
+validate(lazy_samples, "samples.schema.yaml")

Let's verify the validation utility's LazyFrame support:

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed0ac9f and 58287f0.

📒 Files selected for processing (1)
  • snakemake/utils.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake/utils.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🪛 Ruff (0.8.2)
snakemake/utils.py

138-138: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

⏰ Context from checks skipped due to timeout of 90000ms (31)
  • GitHub Check: testing (10, 3.12)
  • GitHub Check: testing (10, 3.11)
  • GitHub Check: testing (9, 3.12)
  • GitHub Check: testing (9, 3.11)
  • GitHub Check: testing (8, 3.12)
  • GitHub Check: testing (8, 3.11)
  • GitHub Check: testing (7, 3.12)
  • GitHub Check: testing (7, 3.11)
  • GitHub Check: testing (6, 3.12)
  • GitHub Check: testing (6, 3.11)
  • GitHub Check: testing-windows (10)
  • GitHub Check: testing (5, 3.12)
  • GitHub Check: testing-windows (9)
  • GitHub Check: testing (5, 3.11)
  • GitHub Check: testing-windows (8)
  • GitHub Check: testing (4, 3.12)
  • GitHub Check: testing-windows (7)
  • GitHub Check: testing (4, 3.11)
  • GitHub Check: testing-windows (6)
  • GitHub Check: testing-windows (5)
  • GitHub Check: testing (3, 3.12)
  • GitHub Check: testing (3, 3.11)
  • GitHub Check: testing-windows (4)
  • GitHub Check: testing (2, 3.12)
  • GitHub Check: testing-windows (3)
  • GitHub Check: testing (2, 3.11)
  • GitHub Check: testing-windows (2)
  • GitHub Check: testing (1, 3.12)
  • GitHub Check: testing (1, 3.11)
  • GitHub Check: testing-windows (1)
  • GitHub Check: apidocs

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 58287f0 and bfc2fe2.

⛔ Files ignored due to path filters (1)
  • tests/test_validate/samples.tsv is excluded by !**/*.tsv
📒 Files selected for processing (4)
  • snakemake/utils.py (1 hunks)
  • tests/test_validate/Snakefile (1 hunks)
  • tests/test_validate/config.schema.yaml (1 hunks)
  • tests/test_validate/samples.schema.yaml (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/test_validate/config.schema.yaml
🧰 Additional context used
📓 Path-based instructions (1)
snakemake/utils.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🪛 Ruff (0.8.2)
snakemake/utils.py

141-141: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


165-165: Comparison to None should be cond is None

Replace with cond is None

(E711)

⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: testing (10, 3.12)
  • GitHub Check: testing (10, 3.11)
  • GitHub Check: testing (9, 3.12)
  • GitHub Check: testing (9, 3.11)
  • GitHub Check: testing (8, 3.12)
  • GitHub Check: testing (8, 3.11)
  • GitHub Check: testing (7, 3.12)
  • GitHub Check: testing (7, 3.11)
  • GitHub Check: testing (5, 3.12)
  • GitHub Check: testing-windows (10)
  • GitHub Check: testing (5, 3.11)
  • GitHub Check: testing-windows (9)
  • GitHub Check: testing-windows (8)
  • GitHub Check: testing-windows (7)
  • GitHub Check: testing (3, 3.12)
  • GitHub Check: testing-windows (6)
  • GitHub Check: testing-windows (5)
  • GitHub Check: testing-windows (4)
  • GitHub Check: testing-windows (3)
  • GitHub Check: testing-windows (2)
🔇 Additional comments (9)
tests/test_validate/samples.schema.yaml (1)

9-17: LGTM! Schema changes look good.

The schema changes are well-structured:

  • Simplified description for condition property improves clarity
  • New properties n and tissue have appropriate types and default values
tests/test_validate/Snakefile (5)

4-4: LGTM! Import statement is correct.

The polars import is properly added.


14-24: LGTM! Polars DataFrame validation is well implemented.

The implementation:

  • Uses correct separator for TSV files
  • Defines appropriate schema with correct data types
  • Includes comprehensive assertions

26-34: LGTM! Polars LazyFrame validation is properly implemented.

The implementation:

  • Maintains consistency with DataFrame validation
  • Correctly handles lazy evaluation
  • Uses set_default=False appropriately as LazyFrames don't support default values

36-41: LGTM! Pandas DataFrame validations are consistent.

Both indexed and non-indexed DataFrame validations:

  • Maintain consistent assertions with polars implementations
  • Properly test the new schema properties

Also applies to: 54-59


43-52: LGTM! Dict validation provides good coverage.

The implementation:

  • Tests both complete and partial dictionaries
  • Properly handles null values
  • Validates schema defaults
snakemake/utils.py (3)

114-116: LGTM! Import and initialization are correct.

The polars import and records initialization are properly implemented.


117-128: LGTM! DataFrame type handling is well implemented.

Good implementation choices:

  • Clear logging messages for each DataFrame type
  • Efficient handling of LazyFrame by limiting to first 1000 rows
  • Proper type checking order

130-141: 🛠️ Refactor suggestion

Use exception chaining for better error tracking.

The error handling should use exception chaining to preserve the original traceback.

Apply this diff:

-                    raise WorkflowError(f"Error validating row {i} of data frame.", e)
+                    raise WorkflowError(f"Error validating row {i} of data frame.") from e

Likely invalid or redundant comment.

🧰 Tools
🪛 Ruff (0.8.2)

141-141: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
snakemake/utils.py (2)

114-128: LGTM! Good optimization for LazyFrame processing.

The implementation correctly handles different DataFrame types and optimizes LazyFrame processing by limiting it to the first 1000 rows.

However, consider making the error message more descriptive:

-                raise WorkflowError("Unsupported data type for validation.")
+                raise WorkflowError(f"Unsupported data type for validation. Expected pandas DataFrame, polars DataFrame, or polars LazyFrame, got {type(data)}.")

171-172: Improve error message for DataFrame validation.

The error message could be more descriptive to help with debugging.

Apply this diff:

-        raise WorkflowError("Error validating data frame.")
+        raise WorkflowError("Error validating data frame. Ensure the data is a pandas DataFrame, polars DataFrame, or polars LazyFrame.")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bfc2fe2 and 1a5250c.

📒 Files selected for processing (1)
  • snakemake/utils.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake/utils.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🪛 Ruff (0.8.2)
snakemake/utils.py

141-141: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

⏰ Context from checks skipped due to timeout of 90000ms (31)
  • GitHub Check: testing (10, 3.12)
  • GitHub Check: testing (10, 3.11)
  • GitHub Check: testing (9, 3.12)
  • GitHub Check: testing (9, 3.11)
  • GitHub Check: testing (8, 3.12)
  • GitHub Check: testing (8, 3.11)
  • GitHub Check: testing (7, 3.12)
  • GitHub Check: testing (7, 3.11)
  • GitHub Check: testing (6, 3.12)
  • GitHub Check: testing (6, 3.11)
  • GitHub Check: testing (5, 3.12)
  • GitHub Check: testing-windows (10)
  • GitHub Check: testing (5, 3.11)
  • GitHub Check: testing-windows (9)
  • GitHub Check: testing-windows (8)
  • GitHub Check: testing (4, 3.12)
  • GitHub Check: testing-windows (7)
  • GitHub Check: testing (4, 3.11)
  • GitHub Check: testing (3, 3.12)
  • GitHub Check: testing-windows (6)
  • GitHub Check: testing-windows (5)
  • GitHub Check: testing (3, 3.11)
  • GitHub Check: testing-windows (4)
  • GitHub Check: testing (2, 3.12)
  • GitHub Check: testing (2, 3.11)
  • GitHub Check: testing-windows (3)
  • GitHub Check: testing (1, 3.12)
  • GitHub Check: testing-windows (2)
  • GitHub Check: testing (1, 3.11)
  • GitHub Check: testing-windows (1)
  • GitHub Check: apidocs
🔇 Additional comments (2)
snakemake/utils.py (2)

143-167: 🛠️ Refactor suggestion

Optimize polars DataFrame updates using vectorized operations.

The current implementation uses nested loops for updating polars DataFrame values, which is inefficient. Consider using vectorized operations instead.

Apply this diff to improve performance:

-                    # Fill in None values with values from newdata
-                    for i in range(data.shape[0]):
-                        for j in range(data.shape[1]):
-                            if data[i, j] is None:
-                                data[i, j] = newdata[i, j]
+                    # Fill in None values with values from newdata using vectorized operations
+                    data = data.with_columns([
+                        pl.when(pl.col(col).is_null())
+                        .then(newdata[col])
+                        .otherwise(pl.col(col))
+                        .alias(col)
+                        for col in data.columns
+                    ])

Likely invalid or redundant comment.


130-141: 🛠️ Refactor suggestion

Add exception chaining to improve error traceability.

The exception handling should use proper chaining to preserve the original traceback.

Apply this diff:

-                    raise WorkflowError(f"Error validating row {i} of data frame.", e)
+                    raise WorkflowError(f"Error validating row {i} of data frame.") from e

Also, consider using polars.is_null() for polars DataFrames instead of pandas.isnull():

-                record = {k: v for k, v in record.items() if not pd.isnull(v)}
+                record = {k: v for k, v in record.items() if not (pd.isnull(v) if isinstance(data, pd.DataFrame) else pl.is_null(v))}

Likely invalid or redundant comment.

🧰 Tools
🪛 Ruff (0.8.2)

141-141: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

@fgvieira fgvieira moved this from Ready to In progress in Snakemake Hackathon March 2025 Mar 10, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/test_validate/Snakefile (1)

50-57: Final overshadowing of ‘samples’ before using it in the rule input.
While this final assignment is needed to access samples.index, it’s best practice to store the earlier Polars-based objects under different names. This ensures clarity when referencing the final samples DataFrame for the rule all input.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a5250c and 89d0a9f.

📒 Files selected for processing (1)
  • tests/test_validate/Snakefile (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (40)
  • GitHub Check: testing (10, 3.12, dash)
  • GitHub Check: testing (10, 3.12, bash)
  • GitHub Check: testing (10, 3.11, bash)
  • GitHub Check: testing (9, 3.12, dash)
  • GitHub Check: testing (9, 3.12, bash)
  • GitHub Check: testing (9, 3.11, bash)
  • GitHub Check: testing (8, 3.12, dash)
  • GitHub Check: testing (8, 3.12, bash)
  • GitHub Check: testing (8, 3.11, bash)
  • GitHub Check: testing (7, 3.12, dash)
  • GitHub Check: testing (7, 3.12, bash)
  • GitHub Check: testing (7, 3.11, bash)
  • GitHub Check: testing (6, 3.12, dash)
  • GitHub Check: testing (6, 3.12, bash)
  • GitHub Check: testing (6, 3.11, bash)
  • GitHub Check: testing (5, 3.12, dash)
  • GitHub Check: testing (5, 3.12, bash)
  • GitHub Check: testing (5, 3.11, bash)
  • GitHub Check: testing (4, 3.12, dash)
  • GitHub Check: testing (4, 3.12, bash)
  • GitHub Check: testing (4, 3.11, bash)
  • GitHub Check: testing-windows (10)
  • GitHub Check: testing (3, 3.12, dash)
  • GitHub Check: testing-windows (9)
  • GitHub Check: testing (3, 3.12, bash)
  • GitHub Check: testing-windows (8)
  • GitHub Check: testing (3, 3.11, bash)
  • GitHub Check: testing-windows (7)
  • GitHub Check: testing (2, 3.12, dash)
  • GitHub Check: testing (2, 3.12, bash)
  • GitHub Check: testing-windows (6)
  • GitHub Check: testing (2, 3.11, bash)
  • GitHub Check: testing-windows (5)
  • GitHub Check: testing (1, 3.12, dash)
  • GitHub Check: testing-windows (4)
  • GitHub Check: testing (1, 3.12, bash)
  • GitHub Check: testing-windows (3)
  • GitHub Check: testing (1, 3.11, bash)
  • GitHub Check: testing-windows (2)
  • GitHub Check: testing-windows (1)
🔇 Additional comments (5)
tests/test_validate/Snakefile (5)

4-4: All good on the import statement.
This import is necessary for the Polars functionality below. No concerns here.


10-20: Repeated overshadowing of the 'samples' variable and schema verification needed.
We’ve previously noted that reassigning samples to different data types (Polars DF, Polars LazyFrame, Pandas DF, dict) can be confusing and prone to errors, particularly if you accidentally rely on old references. Additionally, the code checks for 'tissue' but only 'sample', 'condition', and 'n' appear in the Polars schema. Ensure your YAML schema includes 'tissue' or confirm that Polars gracefully infers it.
[duplicate_comment, request_verification]


22-31: Variable overshadowing repeated with Polars LazyFrame.
You are again overwriting samples, making it difficult to track which DataFrame or LazyFrame is the active one. Consider using separate variables (e.g. polars_df, lazy_df, etc.) to enhance readability and avoid pitfalls.


32-37: Overwriting the ‘samples’ variable again for Pandas usage.
Reusing the ‘samples’ name makes the code more difficult to maintain and debug. A more descriptive variable name (e.g. pandas_df_noindex) can help distinguish it from the Polars data structures above.


39-48: Assigning a dict to ‘samples’ continues the overshadowing pattern.
Switching ‘samples’ from a DataFrame to a dict within the same file can be confusing. Keep each data structure in its own variable to minimize accidental mix-ups.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
snakemake/utils.py (3)

138-138: 🛠️ Refactor suggestion

Use exception chaining for better traceback information.

When re-raising an exception, use from to preserve the original traceback information.

-            raise WorkflowError(f"Error validating row {i} of data frame.", e)
+            raise WorkflowError(f"Error validating row {i} of data frame.") from e
🧰 Tools
🪛 Ruff (0.8.2)

138-138: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


156-156: 🛠️ Refactor suggestion

Use exception chaining for better traceback information.

When re-raising an exception, use from to preserve the original traceback information.

-            raise WorkflowError(f"Error validating row {i} of data frame.", e)
+            raise WorkflowError(f"Error validating row {i} of data frame.") from e
🧰 Tools
🪛 Ruff (0.8.2)

156-156: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


184-184: 🛠️ Refactor suggestion

Use exception chaining for better traceback information.

When re-raising an exception, use from to preserve the original traceback information.

-            raise WorkflowError(f"Error validating row {i} of data frame.", e)
+            raise WorkflowError(f"Error validating row {i} of data frame.") from e
🧰 Tools
🪛 Ruff (0.8.2)

184-184: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🧹 Nitpick comments (1)
snakemake/utils.py (1)

158-174: Consider optimizing Polars DataFrame updates with vectorized operations.

The current implementation iterates through all cells in the DataFrame, which could be inefficient for large datasets. Consider using Polars' vectorized operations for better performance.

    if set_default:
        newdata = pl.DataFrame(recordlist)
        # Add missing columns
        newcol = [col for col in newdata.columns if col not in data.columns]
        [
            data.insert_column(
                len(data.columns),
                pl.lit(None, newdata[col].dtype).alias(col),
            )
            for col in newcol
        ]
        # Fill in None values with values from newdata
-        for i in range(data.shape[0]):
-            for j in range(data.shape[1]):
-                if data[i, j] is None:
-                    data[i, j] = newdata[i, j]
+        # Use vectorized operations to update null values
+        for col in data.columns:
+            if col in newdata.columns:
+                data = data.with_columns([
+                    pl.when(pl.col(col).is_null())
+                    .then(newdata[col])
+                    .otherwise(pl.col(col))
+                    .alias(col)
+                ])
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 89d0a9f and 1fa617b.

📒 Files selected for processing (1)
  • snakemake/utils.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest ...

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

  • snakemake/utils.py
🪛 Ruff (0.8.2)
snakemake/utils.py

128-128: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


138-138: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


156-156: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


184-184: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

⏰ Context from checks skipped due to timeout of 90000ms (40)
  • GitHub Check: testing (10, 3.12, dash)
  • GitHub Check: testing (10, 3.12, bash)
  • GitHub Check: testing (10, 3.11, bash)
  • GitHub Check: testing (9, 3.12, dash)
  • GitHub Check: testing (9, 3.12, bash)
  • GitHub Check: testing (9, 3.11, bash)
  • GitHub Check: testing (8, 3.12, dash)
  • GitHub Check: testing (8, 3.12, bash)
  • GitHub Check: testing (8, 3.11, bash)
  • GitHub Check: testing (7, 3.12, dash)
  • GitHub Check: testing (7, 3.12, bash)
  • GitHub Check: testing (7, 3.11, bash)
  • GitHub Check: testing (6, 3.12, dash)
  • GitHub Check: testing (6, 3.12, bash)
  • GitHub Check: testing (6, 3.11, bash)
  • GitHub Check: testing (5, 3.12, dash)
  • GitHub Check: testing (5, 3.12, bash)
  • GitHub Check: testing (5, 3.11, bash)
  • GitHub Check: testing (4, 3.12, dash)
  • GitHub Check: testing (4, 3.12, bash)
  • GitHub Check: testing (4, 3.11, bash)
  • GitHub Check: testing-windows (10)
  • GitHub Check: testing (3, 3.12, dash)
  • GitHub Check: testing-windows (9)
  • GitHub Check: testing (3, 3.12, bash)
  • GitHub Check: testing-windows (8)
  • GitHub Check: testing (3, 3.11, bash)
  • GitHub Check: testing-windows (7)
  • GitHub Check: testing (2, 3.12, dash)
  • GitHub Check: testing-windows (6)
  • GitHub Check: testing (2, 3.12, bash)
  • GitHub Check: testing-windows (5)
  • GitHub Check: testing (2, 3.11, bash)
  • GitHub Check: testing (1, 3.12, dash)
  • GitHub Check: testing-windows (4)
  • GitHub Check: testing (1, 3.12, bash)
  • GitHub Check: testing-windows (3)
  • GitHub Check: testing (1, 3.11, bash)
  • GitHub Check: testing-windows (2)
  • GitHub Check: testing-windows (1)
🔇 Additional comments (3)
snakemake/utils.py (3)

114-122: Good refactoring of validation logic!

The extraction of common validation logic into a helper function is a good design choice, making the code more maintainable and reducing duplication.


148-157: Good implementation of Polars DataFrame validation!

The validation logic for Polars DataFrames follows a similar pattern to the pandas implementation, using appropriate Polars-specific methods like iter_rows(named=True).

🧰 Tools
🪛 Ruff (0.8.2)

156-156: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


175-188: Good implementation of LazyFrame validation with reasonable limitations.

Limiting validation to the first 1000 rows is a sensible approach for potentially large LazyFrames. The warning about not supporting default values is appropriate since LazyFrames are immutable.

🧰 Tools
🪛 Ruff (0.8.2)

184-184: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
snakemake/utils.py (3)

148-161: Use exception chaining when validating polars DataFrames

Continue the pattern of using exception chaining for better error diagnostics.

Apply this diff to improve error handling:

-                raise WorkflowError(f"Error validating row {i} of data frame.", e)
+                raise WorkflowError(f"Error validating row {i} of data frame.") from e
🧰 Tools
🪛 Ruff (0.8.2)

160-160: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


162-178: Consider using vectorized operations for polars DataFrame updates

The current implementation uses nested loops to update the polars DataFrame, which could be inefficient for large datasets. Polars is designed for vectorized operations.

Consider this alternative approach using vectorized operations:

if set_default:
    newdata = pl.DataFrame(recordlist)
    # Add missing columns
    for col in [c for c in newdata.columns if c not in data.columns]:
        data = data.with_columns([
            pl.lit(None, newdata[col].dtype).alias(col)
        ])
    # Update existing values
    data = data.with_columns([
        pl.when(pl.col(col).is_null())
        .then(pl.col(col).fill_null(newdata[col]))
        .otherwise(pl.col(col))
        .alias(col)
        for col in data.columns
    ])

179-193: Use exception chaining for LazyFrame validation

Consistent with previous comments, use exception chaining here as well.

Apply this diff:

-                raise WorkflowError(f"Error validating row {i} of data frame.", e)
+                raise WorkflowError(f"Error validating row {i} of data frame.") from e
🧰 Tools
🪛 Ruff (0.8.2)

192-192: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d10cf0b and 8267350.

📒 Files selected for processing (1)
  • snakemake/utils.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest ...

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

  • snakemake/utils.py
🪛 Ruff (0.8.2)
snakemake/utils.py

126-126: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


138-138: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


160-160: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


192-192: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

⏰ Context from checks skipped due to timeout of 90000ms (23)
  • GitHub Check: testing (10, 3.12, dash)
  • GitHub Check: testing (10, 3.12, bash)
  • GitHub Check: testing (10, 3.11, bash)
  • GitHub Check: testing (4, 3.11, bash)
  • GitHub Check: testing (3, 3.12, dash)
  • GitHub Check: testing-windows (10)
  • GitHub Check: testing (3, 3.12, bash)
  • GitHub Check: testing-windows (9)
  • GitHub Check: testing (3, 3.11, bash)
  • GitHub Check: testing-windows (8)
  • GitHub Check: testing (2, 3.12, dash)
  • GitHub Check: testing-windows (7)
  • GitHub Check: testing (2, 3.12, bash)
  • GitHub Check: testing-windows (6)
  • GitHub Check: testing (2, 3.11, bash)
  • GitHub Check: testing-windows (5)
  • GitHub Check: testing-windows (4)
  • GitHub Check: testing (1, 3.12, dash)
  • GitHub Check: testing (1, 3.12, bash)
  • GitHub Check: testing-windows (3)
  • GitHub Check: testing (1, 3.11, bash)
  • GitHub Check: testing-windows (2)
  • GitHub Check: testing-windows (1)
🔇 Additional comments (6)
snakemake/utils.py (6)

111-113: Ensure polars is correctly imported

The import statement now includes both pandas and polars, which is necessary for handling both types of DataFrames. The code follows past review comments by correctly importing polars as pl.


114-120: Good refactoring with new helper function

Extracting the validation logic into _validate_record helper function improves code maintainability by reducing duplication across different data types.


121-126: Use exception chaining for better error diagnostics

When re-raising exceptions, it's best practice to use exception chaining to preserve the original traceback.

Apply this diff to improve error handling:

-        raise WorkflowError("Error validating config file.", e)
+        raise WorkflowError("Error validating config file.") from e
🧰 Tools
🪛 Ruff (0.8.2)

126-126: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


128-138: Use exception chaining when validating pandas DataFrames

Similar to the previous comment, use exception chaining for better error diagnostics.

Apply this diff to improve error handling:

-            raise WorkflowError(f"Error validating row {i} of data frame.", e)
+            raise WorkflowError(f"Error validating row {i} of data frame.") from e
🧰 Tools
🪛 Ruff (0.8.2)

138-138: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


194-196: Good handling of LazyFrame limitations

The code correctly warns that LazyFrames don't support setting default values, which is a clear limitation for users.


197-198: Appropriate error handling for unsupported data types

The error message clearly communicates that the provided data type is not supported for validation.

@fgvieira fgvieira moved this from In progress to In review in Snakemake Hackathon March 2025 Mar 10, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
tests/test_validate/Snakefile (1)

14-25: Consider renaming the variable to avoid confusion.
Repeatedly overwriting samples with different DataFrame types (Polars DataFrame here, then LazyFrame, then Pandas) can be confusing and error-prone. If you need to switch between data types, consider storing them in separate variables (e.g., pl_samples, pl_lazy_samples, pd_samples) before validation.

-# Polars DataFrame
-samples = pl.read_csv(
+pl_samples = pl.read_csv(
    config["samples"],
    separator="\t",
    ...
)
-validate(samples, "samples.schema.yaml")
+validate(pl_samples, "samples.schema.yaml")
snakemake/utils.py (3)

118-149: Use exception chaining in _validate_pandas().
In lines 132-134, re-raising the ValidationError without chaining hides the original traceback. Prefer raise WorkflowError(...) from e to preserve context.

except jsonschema.exceptions.ValidationError as e:
-    raise WorkflowError(
-        f"Error validating row {i} of data frame.", e
-    )
+    raise WorkflowError(
+        f"Error validating row {i} of data frame."
+    ) from e
🧰 Tools
🪛 Ruff (0.8.2)

132-134: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


150-218: Several improvements recommended for _validate_polars().

  1. Use exception chaining when re-raising at lines 168-170 and 206-208.
  2. Consider vectorized updates for setting default values instead of row-by-row loops at lines 172-188, which can be slow for large data.
# Example for exception chaining:
except jsonschema.exceptions.ValidationError as e:
-    raise WorkflowError(f"Error validating row {i} of data frame.", e)
+    raise WorkflowError(f"Error validating row {i} of data frame.") from e
# Example for vectorized filling of missing values:
- for i in range(data.shape[0]):
-     for j in range(data.shape[1]):
-         if data[i, j] is None:
-             data[i, j] = newdata[i, j]
+ # Instead, construct a mask of null rows and columns, then fill them with newdata columns
+ # or consider using pl.when/pl.col for a more efficient approach.
🧰 Tools
🪛 Ruff (0.8.2)

168-170: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


206-208: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


219-226: Use exception chaining in the dictionary validation block as well.
At line 224, re-raising the exception with raise WorkflowError(...) from e is recommended to preserve the original traceback.

except jsonschema.exceptions.ValidationError as e:
-    raise WorkflowError("Error validating config file.", e)
+    raise WorkflowError("Error validating config file.") from e
🧰 Tools
🪛 Ruff (0.8.2)

224-224: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8267350 and 669ed52.

📒 Files selected for processing (13)
  • tests/test_validate/Snakefile (2 hunks)
  • tests/test_validate/Snakefile (1 hunks)
  • snakemake/utils.py (1 hunks)
  • snakemake/utils.py (2 hunks)
  • tests/test_validate/Snakefile (1 hunks)
  • snakemake/utils.py (1 hunks)
  • tests/test_validate/Snakefile (2 hunks)
  • snakemake/utils.py (1 hunks)
  • snakemake/utils.py (2 hunks)
  • snakemake/utils.py (5 hunks)
  • snakemake/utils.py (2 hunks)
  • snakemake/utils.py (1 hunks)
  • tests/test_validate/Snakefile (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • snakemake/utils.py
🚧 Files skipped from review as they are similar to previous changes (7)
  • snakemake/utils.py
  • tests/test_validate/Snakefile
  • snakemake/utils.py
  • tests/test_validate/Snakefile
  • tests/test_validate/Snakefile
  • snakemake/utils.py
  • snakemake/utils.py
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest ...

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

  • snakemake/utils.py
🪛 Ruff (0.8.2)
snakemake/utils.py

132-134: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


132-134: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


168-170: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


132-134: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


168-170: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


206-208: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


224-224: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

⏰ Context from checks skipped due to timeout of 90000ms (40)
  • GitHub Check: testing (10, 3.12, dash)
  • GitHub Check: testing (10, 3.12, bash)
  • GitHub Check: testing (10, 3.11, bash)
  • GitHub Check: testing (9, 3.12, dash)
  • GitHub Check: testing (9, 3.12, bash)
  • GitHub Check: testing (9, 3.11, bash)
  • GitHub Check: testing (8, 3.12, dash)
  • GitHub Check: testing (8, 3.12, bash)
  • GitHub Check: testing (8, 3.11, bash)
  • GitHub Check: testing (7, 3.12, dash)
  • GitHub Check: testing (7, 3.12, bash)
  • GitHub Check: testing (7, 3.11, bash)
  • GitHub Check: testing (6, 3.12, dash)
  • GitHub Check: testing (6, 3.12, bash)
  • GitHub Check: testing (6, 3.11, bash)
  • GitHub Check: testing (5, 3.12, dash)
  • GitHub Check: testing (5, 3.12, bash)
  • GitHub Check: testing (5, 3.11, bash)
  • GitHub Check: testing (4, 3.12, dash)
  • GitHub Check: testing (4, 3.12, bash)
  • GitHub Check: testing (4, 3.11, bash)
  • GitHub Check: testing (3, 3.12, dash)
  • GitHub Check: testing-windows (10)
  • GitHub Check: testing (3, 3.12, bash)
  • GitHub Check: testing-windows (9)
  • GitHub Check: testing (3, 3.11, bash)
  • GitHub Check: testing-windows (8)
  • GitHub Check: testing (2, 3.12, dash)
  • GitHub Check: testing-windows (7)
  • GitHub Check: testing (2, 3.12, bash)
  • GitHub Check: testing-windows (6)
  • GitHub Check: testing (2, 3.11, bash)
  • GitHub Check: testing-windows (5)
  • GitHub Check: testing (1, 3.12, dash)
  • GitHub Check: testing-windows (4)
  • GitHub Check: testing (1, 3.12, bash)
  • GitHub Check: testing-windows (3)
  • GitHub Check: testing (1, 3.11, bash)
  • GitHub Check: testing-windows (2)
  • GitHub Check: testing-windows (1)
🔇 Additional comments (5)
tests/test_validate/Snakefile (5)

26-35: Validate large LazyFrames with caution.
Reading 1000 rows from a large LazyFrame is reasonable for quick checks. If you need more thorough validation, consider an additional mechanism to sample or partition the data to avoid excessive memory usage.


36-41: Polars to Pandas transition looks good.
No issues spotted. The validation flow for the Pandas DataFrame without an index is clear and straightforward.


43-52: Dictionary validation approach is effective.
Validating individual rows as dictionaries is a good complement to DataFrame-level checks.


54-59: Retaining the index in Pandas DataFrame is consistent.
The logic to retain and validate the “sample” column by setting it as the DataFrame index works properly. No concerns here.


64-64:

Details

❓ Verification inconclusive

Confirm final assignment of samples.
Even though you reassign samples multiple times, you ultimately rely on samples.index here. Please verify that this references the Pandas DataFrame with index from line 55 and not an unintended earlier version.


🏁 Script executed:

#!/bin/bash
# Ensure references to `samples.index` come after the final assignment to Pandas.
rg -A 3 'samples.index'

Length of output: 231


Action Required: Verify that the final assignment of samples is correctly used

I didn’t find any occurrence of an outdated samples.index reference in the test file—the only instance appears in the documentation at docs/snakefiles/configuration.rst. Please double-check that the final assignment in tests/test_validate/Snakefile (particularly the DataFrame setup on line 55) is the one that’s ultimately used (i.e. that later references to .index come from this final, correct assignment rather than an earlier version).

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
tests/tests_using_conda.py (1)

332-345: ⚠️ Potential issue

Remove duplicated test functions

The test functions test_get_log_both, test_get_log_stderr, test_get_log_stdout, and test_get_log_complex are duplicated in this file. The same functions appear earlier at lines 318-331. These duplicates should be removed to prevent confusion and potential testing issues.

-def test_get_log_both():
-    run(dpath("test_get_log_both"))
-
-
-def test_get_log_stderr():
-    run(dpath("test_get_log_stderr"))
-
-
-def test_get_log_stdout():
-    run(dpath("test_get_log_stdout"))
-
-
-def test_get_log_complex():
-    run(dpath("test_get_log_complex"))
🧹 Nitpick comments (3)
snakemake/script/__init__.py (1)

56-102: Well-implemented ReportHref class for report hyperlink management.

The class provides a clean interface for creating and manipulating report hyperlinks, with methods for child paths, URL arguments, and anchors. The implementation properly handles the hierarchical relationships and URL encoding.

The code in lines 97-100 can be simplified using a ternary operator:

-        if self._anchor:
-            anchor = f"#{urllib.parse.quote(self._anchor)}"
-        else:
-            anchor = ""
+        anchor = f"#{urllib.parse.quote(self._anchor)}" if self._anchor else ""
🧰 Tools
🪛 Ruff (0.8.2)

97-100: Use ternary operator anchor = f"#{urllib.parse.quote(self._anchor)}" if self._anchor else "" instead of if-else-block

Replace if-else-block with anchor = f"#{urllib.parse.quote(self._anchor)}" if self._anchor else ""

(SIM108)

snakemake/workflow.py (1)

127-127: Fix unused import in module.

The api module is imported but not used anywhere in the file.

-from snakemake import api, caching, sourcecache
+from snakemake import caching, sourcecache
🧰 Tools
🪛 Ruff (0.8.2)

127-127: snakemake.api imported but unused

Remove unused import: snakemake.api

(F401)

docs/snakefiles/reporting.rst (1)

137-140: Dynamic Metadata for Report Items:
Allowing dynamic determination of category, subcategory, and labels through functions adds useful flexibility to report configuration. To further benefit users, consider adding a short code example demonstrating a function that returns these values, along with any type expectations.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 669ed52 and f60b8fd.

⛔ Files ignored due to path filters (1)
  • tests/test_script_xsh/expected-results/test.out is excluded by !**/*.out
📒 Files selected for processing (23)
  • docs/snakefiles/deployment.rst (4 hunks)
  • docs/snakefiles/reporting.rst (4 hunks)
  • docs/snakefiles/rules.rst (2 hunks)
  • snakemake/cli.py (4 hunks)
  • snakemake/script/__init__.py (4 hunks)
  • snakemake/workflow.py (4 hunks)
  • tests/tests_using_conda.py (1 hunks)
  • docs/snakefiles/deployment.rst (2 hunks)
  • docs/snakefiles/reporting.rst (5 hunks)
  • docs/snakefiles/rules.rst (1 hunks)
  • snakemake/cli.py (1 hunks)
  • snakemake/script/__init__.py (3 hunks)
  • snakemake/workflow.py (1 hunks)
  • tests/test_conda_python_3_7_script/Snakefile (1 hunks)
  • tests/test_conda_python_3_7_script/test_script_python_3_7.py (1 hunks)
  • tests/test_conda_run/Snakefile (1 hunks)
  • tests/test_conda_run/expected-results/test.txt (1 hunks)
  • tests/test_conda_run/test_python_env.yaml (1 hunks)
  • tests/test_conda_run/test_script_run.py (1 hunks)
  • tests/test_script_xsh/Snakefile (1 hunks)
  • tests/test_script_xsh/envs/xonsh.yaml (1 hunks)
  • tests/test_script_xsh/scripts/test.xsh (1 hunks)
  • tests/tests_using_conda.py (1 hunks)
✅ Files skipped from review due to trivial changes (6)
  • tests/test_conda_run/expected-results/test.txt
  • tests/test_conda_python_3_7_script/test_script_python_3_7.py
  • tests/test_script_xsh/scripts/test.xsh
  • tests/test_conda_run/test_script_run.py
  • tests/test_conda_run/test_python_env.yaml
  • tests/test_script_xsh/envs/xonsh.yaml
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest ...

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

  • tests/tests_using_conda.py
  • snakemake/cli.py
  • snakemake/workflow.py
  • snakemake/script/__init__.py
🪛 Ruff (0.8.2)
snakemake/workflow.py

1812-1812: Avoid inequality comparisons to False; use if ruleinfo.container_img: for truth checks

Replace with ruleinfo.container_img

(E712)


127-127: snakemake.api imported but unused

Remove unused import: snakemake.api

(F401)

snakemake/script/__init__.py

97-100: Use ternary operator anchor = f"#{urllib.parse.quote(self._anchor)}" if self._anchor else "" instead of if-else-block

Replace if-else-block with anchor = f"#{urllib.parse.quote(self._anchor)}" if self._anchor else ""

(SIM108)

⏰ Context from checks skipped due to timeout of 90000ms (40)
  • GitHub Check: testing (10, 3.12, dash)
  • GitHub Check: testing (10, 3.12, bash)
  • GitHub Check: testing (10, 3.11, bash)
  • GitHub Check: testing (9, 3.12, dash)
  • GitHub Check: testing (9, 3.12, bash)
  • GitHub Check: testing (9, 3.11, bash)
  • GitHub Check: testing (8, 3.12, dash)
  • GitHub Check: testing (8, 3.12, bash)
  • GitHub Check: testing (8, 3.11, bash)
  • GitHub Check: testing (7, 3.12, dash)
  • GitHub Check: testing (7, 3.12, bash)
  • GitHub Check: testing (7, 3.11, bash)
  • GitHub Check: testing (6, 3.12, dash)
  • GitHub Check: testing (6, 3.12, bash)
  • GitHub Check: testing (6, 3.11, bash)
  • GitHub Check: testing (5, 3.12, dash)
  • GitHub Check: testing (5, 3.12, bash)
  • GitHub Check: testing (5, 3.11, bash)
  • GitHub Check: testing (4, 3.12, dash)
  • GitHub Check: testing (4, 3.12, bash)
  • GitHub Check: testing (4, 3.11, bash)
  • GitHub Check: testing-windows (10)
  • GitHub Check: testing (3, 3.12, dash)
  • GitHub Check: testing-windows (9)
  • GitHub Check: testing (3, 3.12, bash)
  • GitHub Check: testing (3, 3.11, bash)
  • GitHub Check: testing-windows (8)
  • GitHub Check: testing (2, 3.12, dash)
  • GitHub Check: testing-windows (7)
  • GitHub Check: testing (2, 3.12, bash)
  • GitHub Check: testing-windows (6)
  • GitHub Check: testing (2, 3.11, bash)
  • GitHub Check: testing-windows (5)
  • GitHub Check: testing-windows (4)
  • GitHub Check: testing (1, 3.12, dash)
  • GitHub Check: testing-windows (3)
  • GitHub Check: testing (1, 3.12, bash)
  • GitHub Check: testing-windows (2)
  • GitHub Check: testing (1, 3.11, bash)
  • GitHub Check: testing-windows (1)
🔇 Additional comments (25)
tests/test_script_xsh/Snakefile (1)

1-13: Looks good!

This Snakefile is well-structured and follows standard Snakemake conventions. It defines a simple workflow with two rules: all which specifies the final output, and test_xonsh which uses a conda environment and executes a script.

tests/test_conda_python_3_7_script/Snakefile (1)

7-7: Appropriate script file name update.

The change from "test_script.py" to "test_script_python_3_7.py" clearly indicates the Python version to be used, which improves clarity.

tests/test_conda_run/Snakefile (1)

1-14: Well-structured test rule with helpful comments.

The rule is properly defined with appropriate output, conda, params and run directives. The comments clearly explain that this pattern is for testing purposes only and provides guidance about what should be used in real workflows.

docs/snakefiles/deployment.rst (3)

425-425: Correct capitalization of Docker.

The change properly capitalizes "Docker" as it's a proper noun/brand name.


451-454: Valuable clarification about container isolation behavior.

This addition provides important information about how environment variable handling differs between container engines (Docker vs Apptainer/Singularity), which will help users understand and troubleshoot container behavior.


640-645: Enhanced documentation about global workflow dependencies.

The additions to this section improve the documentation about global dependencies by including more examples and clarifying the relationship with storage plugins.

Also applies to: 662-667

snakemake/script/__init__.py (4)

23-26: Appropriate imports added for new ReportHref functionality.

The new imports are correctly added to support the ReportHref class implementation: Dict for type hints, urllib.parse for URL encoding, and TypeVar for creating a self-reference type for Python <3.11 compatibility.


42-42: Import of get_report_id is appropriately added.

This import is necessary for the report href functionality to generate unique identifiers for reports.


52-54: Good backwards compatibility approach with TypeVar.

Using a TypeVar bound to the ReportHref class provides compatibility with Python versions before 3.11 where typing.Self isn't available, ensuring the class works across different Python versions.


133-141: Well-documented method for generating report links.

The report_href method provides a convenient way to create hyperlinks to paths within the report context, with good documentation explaining its purpose and the possibility of extending to child paths.

snakemake/workflow.py (2)

57-57: Appropriate addition of MissingOutputFileCachePathException to imports.

This exception is now properly imported to handle cases where cache path is missing in the _prepare_dag method.


689-708: Improved error handling for output file cache initialization.

This change enhances the robustness of the code by gracefully handling missing cache path scenarios. Instead of failing with an exception, it now catches the MissingOutputFileCachePathException and provides a helpful warning message to the user with clear instructions on how to resolve the issue.

The message includes the specific environment variable that needs to be set (caching.LOCATION_ENVVAR), showing good user experience considerations.

docs/snakefiles/rules.rst (2)

923-927: Enhanced GPU Resource Definitions:
The new definitions for gpu, gpu_manufacturer, and gpu_model significantly clarify how GPU resources should be specified. Please double-check that the renaming from “gpus” to “gpu” is propagated consistently across all examples and explanatory text in the document.


972-973: Updated Resource Scopes Example:
The revised example using
  $ snakemake --set-resource-scopes gpu=global disk_mb=local
improves clarity regarding resource categorization. Please verify that this example aligns correctly with the new GPU resource naming conventions throughout the documentation.

docs/snakefiles/reporting.rst (10)

7-10: Updated Report Generation Overview:
The introduction now clearly distinguishes between self-contained HTML reports and ZIP archives, outlining their purposes and benefits since Snakemake 5.1. This update provides a succinct overview of report capabilities. Consider adding links to more detailed examples or further documentation if available.


13-16: New "Including Results in a Report" Section:
The addition of this section is valuable—it outlines how output files must be marked with the report flag, specifying captions and categories to guide report assembly. This extra guidance will help users better annotate their workflows for reporting purposes.


112-127: Enhanced Example with File Labels:
In the modified example for rule b, the inclusion of a labels dictionary adds a higher level of abstraction by enabling descriptive names instead of raw filenames. Please ensure that the placeholder (e.g. {model}) is rendered correctly at runtime and consider including a brief note on expected value types if needed.


142-147: Linking from Captions Clarification:
The new subsection on linking from captions provides users with a clear procedure to reference various panels and report items directly from their reStructuredText documents. Including a brief warning about common pitfalls (such as ensuring proper hyperlink targets) might further advise users.


153-160: Linking Within Report Results:
Explaining the usage of snakemake.report_href() to generate internal report links is an excellent enhancement. The added details, including how to extend links with child paths, URL arguments, or anchors, make this feature more accessible.


247-253: Report Rendering Details:
The "Rendering reports" section now explains that all metadata (e.g. runtime statistics) are automatically collected from the .snakemake directory. This clarification improves transparency on how the report content is generated.


255-268: Self-Contained HTML Report Instructions:
The instructions for creating a self-contained HTML report (using snakemake --report and specifying a custom name with --report my_report.html) are clear and straightforward. Please confirm that the reference (snakefiles-self_contained_html_file) is correctly implemented elsewhere and that any file size or embedding limitations are documented if necessary.


275-290: Self-Contained ZIP Archive Option:
The new section on generating a self-contained ZIP archive report is comprehensive—it explains the use case for more complex reports and how to generate them by specifying a .zip file. This will be especially useful for large workflows with many samples. Just ensure that instructions regarding unpacking and identifying the report.html entry point are fully accurate.


292-302: Partial Reports Feature:
Allowing users to generate reports for only specific targets (e.g. by running snakemake fig1.svg --report report-short.html) is a practical addition for iterative development and debugging. It might be helpful to briefly mention any restrictions or interactions with other reporting options.


306-318: Custom Report Layout:
The section on custom layouts explains how to apply a custom stylesheet to a report for branding or aesthetic purposes. The provided example and references to downloadable source files are especially valuable. This addition offers significant flexibility for advanced users who wish to tailor the report’s appearance.

snakemake/cli.py (1)

957-966: Well-crafted documentation improvement

The updated help text for the --report argument provides much clearer guidance on how to use this feature. It effectively explains the distinction between using .html files for smaller datasets versus .zip files for larger ones, and includes helpful instructions for accessing the report after unpacking the archive.

Copy link
Copy Markdown
Contributor

@johanneskoester johanneskoester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing that is missing here is to mention that both polars and pandas work for validate here: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#validation

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

🔭 Outside diff range comments (2)
.github/workflows/main.yml (2)

94-98: ⚠️ Potential issue

Duplicate and Malformed AppArmor Restriction Step

There are two steps named "Disable apparmor namespace restrictions for apptainer," and the second instance (lines 96–98) contains a redirection (> /etc/sysctl.d/90-disable-userns-restrictions.conf') without a preceding command. This looks like a syntax error and duplication. Please consolidate these steps and ensure the intended command is complete. For example, you might want to create or overwrite the configuration file with valid shell syntax.


147-154: ⚠️ Potential issue

Duplicate 'with:' Block in Artifact Upload Step

In the "Upload report" step, there are two separate with: blocks. This duplication can lead to YAML parsing issues. Remove the redundant block and merge the parameters so that only one with: block is present. For example:

-        with:
-          name: report-${{ matrix.py_ver }}.zip
-        with:
-          name: report-${{ matrix.py_ver }}.zip
-          path: tests/test_report/report.zip
+        with:
+          name: report-${{ matrix.py_ver }}.zip
+          path: tests/test_report/report.zip
🧹 Nitpick comments (2)
.github/workflows/main.yml (2)

127-133: Duplicate Test File Entries in Pytest Command

The pytest command lists tests/test_io.py and tests/test_schema.py twice. Consolidating these duplicate entries would streamline test execution and prevent redundancy.


184-190: Repeated In-Place File Editing in Windows Testing

Within the Windows testing steps (lines 184–190), the script processes the file test-environment.yml twice using near-identical loops. If this repetition is unintentional, consolidating the loops will reduce redundant file operations. If it is intentional, consider adding a comment to clarify the purpose.

🛑 Comments failed to post (1)
snakemake/ioutils/input.py (1)

12-30: ⚠️ Potential issue

Fix issues with lambda, error handling, and undefined WorkflowError.

There are a few issues that need addressing:

  1. Using a lambda for fix_file_name is discouraged (E731)
  2. WorkflowError is undefined (F821)
  3. The exception isn't properly chained (B904)
-def extract_checksum(infile, **kwargs):
+from snakemake.exceptions import WorkflowError
+
+def extract_checksum(infile, **kwargs):
     try:
         import pandas as pd
 
-        fix_file_name = lambda x: x.removeprefix("./")
+        def fix_file_name(x):
+            return x.removeprefix("./")
+        
         return (
             pd.read_csv(
                 infile,
                 sep="  ",
                 header=None,
                 engine="python",
                 converters={1: fix_file_name},
             )
             .set_index(1)
             .loc[fix_file_name(kwargs.get("file"))]
             .item()
         )
     except ImportError:
-        raise WorkflowError("Pandas is required to extract checksum from file.")
+        raise WorkflowError("Pandas is required to extract checksum from file.") from None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

from snakemake.exceptions import WorkflowError

def extract_checksum(infile, **kwargs):
    try:
        import pandas as pd

        def fix_file_name(x):
            return x.removeprefix("./")
        
        return (
            pd.read_csv(
                infile,
                sep="  ",
                header=None,
                engine="python",
                converters={1: fix_file_name},
            )
            .set_index(1)
            .loc[fix_file_name(kwargs.get("file"))]
            .item()
        )
    except ImportError:
        raise WorkflowError("Pandas is required to extract checksum from file.") from None
🧰 Tools
🪛 Ruff (0.8.2)

16-16: Do not assign a lambda expression, use a def

Rewrite fix_file_name as a def

(E731)


30-30: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


30-30: Undefined name WorkflowError

(F821)

@johanneskoester johanneskoester merged commit c7473a6 into snakemake:main Mar 11, 2025
41 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in Snakemake Hackathon March 2025 Mar 11, 2025
@fgvieira fgvieira deleted the polars_validate branch March 11, 2025 15:46
johanneskoester pushed a commit that referenced this pull request Mar 14, 2025
🤖 I have created a release *beep* *boop*
---


##
[9.0.0](v8.30.0...v9.0.0)
(2025-03-14)


### ⚠ BREAKING CHANGES

* Logging refactor & add LoggerPluginInterface
([#3107](#3107))

### Features

* [#3412](#3412) - keep
shadow folder of failed job if --keep-incomplete flag is set.
([#3430](#3430))
([22978c3](22978c3))
* add flag --report-after-run to automatically generate the report after
a successfull workflow run
([#3428](#3428))
([b0a7f03](b0a7f03))
* add flatten function to IO utils
([#3424](#3424))
([67fa392](67fa392))
* add helper functions to parse input files
([#2918](#2918))
([63e45a7](63e45a7))
* Add option to print redacted file names
([#3089](#3089))
([ba4d264](ba4d264))
* add support for validation of polars dataframe and lazyframe
([#3262](#3262))
([c7473a6](c7473a6))
* added support for rendering dag with mermaid js
([#3409](#3409))
([7bf8381](7bf8381))
* adding --replace-workflow-config to fully replace workflow configs
(from config: directive) with --configfile, instead of merging them
([#3381](#3381))
([47504a0](47504a0))
* Dynamic module name
([#3401](#3401))
([024dc32](024dc32))
* Enable saving and reloading IOCache object
([#3386](#3386))
([c935953](c935953))
* files added in rule params with workflow.source_path will be available
in used containers
([#3385](#3385))
([a6e45bf](a6e45bf))
* Fix keep_local in storage directive and more freedom over remote
retrieval behaviour
([#3410](#3410))
([67b4739](67b4739))
* inherit parameters of use rule and extend/replace individual items
them when using 'with' directive
([#3365](#3365))
([93e4b92](93e4b92))
* Logging refactor & add LoggerPluginInterface
([#3107](#3107))
([86f1d6e](86f1d6e))
* Maximal file size for checksums
([#3368](#3368))
([b039f8a](b039f8a))
* Modernize package configuration using Pixi
([#3369](#3369))
([77992d8](77992d8))
* multiext support for named input/output
([#3372](#3372))
([05e1378](05e1378))
* optionally auto-group jobs via temp files in case of remote execution
([#3378](#3378))
([cc9bba2](cc9bba2))


### Bug Fixes

* `--delete-all-output` ignores `--dry-run`
([#3265](#3265))
([23fef82](23fef82))
* 3342 faster touch runs and warning messages for non-existing files
([#3398](#3398))
([cd9c3c3](cd9c3c3))
* add default value to max-jobs-per-timespan
([#3043](#3043))
([2959abe](2959abe))
* checkpoints inside modules are overwritten
([#3359](#3359))
([fba3ac7](fba3ac7))
* Convert Path to IOFile
([#3405](#3405))
([c58684c](c58684c))
* Do not perform storage object cleanup with --keep-storage-local-copies
set ([#3358](#3358))
([9a6d14b](9a6d14b))
* edgecases of source deployment in case of remote execution
([#3396](#3396))
([5da13be](5da13be))
* enhance error message formatting for strict DAG-building mode
([#3376](#3376))
([a1c39ee](a1c39ee))
* fix bug in checkpoint handling that led to exceptions in case
checkpoint output was missing upon rerun
([#3423](#3423))
([8cf4a2f](8cf4a2f))
* force check all required outputs
([#3341](#3341))
([495a4e7](495a4e7))
* group job formatting
([#3442](#3442))
([f0b10a3](f0b10a3))
* in remote jobs, upload storage in topological order such that
modification dates are preserved (e.g. in case of group jobs)
([#3377](#3377))
([eace08f](eace08f))
* only skip eval when resource depends on input
([#3374](#3374))
([4574c92](4574c92))
* Prevent execution of conda in apptainer when not explicitly requested
in software deployment method
([#3388](#3388))
([c43c5c0](c43c5c0))
* print filenames with quotes around them in RuleException
([#3269](#3269))
([6baeda5](6baeda5))
* Re-evaluation of free resources
([#3399](#3399))
([6371293](6371293))
* ReadTheDocs layout issue due to src directory change
([#3419](#3419))
([695b127](695b127))
* robustly escaping quotes in generated bash scripts (v2)
([#3297](#3297))
([#3389](#3389))
([58720bd](58720bd))
* Show apptainer image URL in snakemake report
([#3407](#3407))
([45f0450](45f0450))
* Update ReadTheDocs configuration for documentation build to use Pixi
([#3433](#3433))
([3f227a6](3f227a6))


### Documentation

* Add pixi setup instructions to general use tutorial
([#3382](#3382))
([115e81b](115e81b))
* fix contribution section heading levels, fix docs testing setup order
([#3360](#3360))
([051dc53](051dc53))
* fix link to github.com/snakemake/poetry-snakemake-plugin
([#3436](#3436))
([ec6d97c](ec6d97c))
* fix quoting
([#3394](#3394))
([b40f599](b40f599))
* fix rerun-triggers default
([#3403](#3403))
([4430e23](4430e23))
* fix typo 'safe' -&gt; 'save'
([#3384](#3384))
([7755861](7755861))
* mention code formatting in the contribution section
([#3431](#3431))
([e8682b7](e8682b7))
* remove duplicated 'functions'.
([#3356](#3356))
([7c595db](7c595db))
* update broken links documentation
([#3437](#3437))
([e3d0d88](e3d0d88))
* Updating contributing guidelines with new pixi dev setup
([#3415](#3415))
([8e95a12](8e95a12))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: snakemake-bot <snakemake-bot-admin@googlegroups.com>
kjohnsen pushed a commit to kjohnsen/snakemake that referenced this pull request Dec 15, 2025
…nakemake#3262)

<!--Add a description of your PR here-->

Implement validation for polars DataFrame and LazyFrame.

Re-factoring the setting of default values.

### QC
<!-- Make sure that you can tick the boxes below. -->

* [x] The PR contains a test case for the changes or the changes are
already covered by an existing test case.
* [x] The documentation (`docs/`) is updated to reflect the changes or
this is not necessary (e.g. if the change does neither modify the
language nor the behavior or functionalities of Snakemake).


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced sample data validation now supports both Pandas and Polars
data frames for improved reliability and performance.
- Introduced new methods for reading sample data using Polars, expanding
data handling options.
- Added support for executing Xonsh scripts within Snakemake workflows.
  - New rule added for running Python scripts with Conda environments.
- New functionality for generating self-contained HTML reports,
including default statistics and user-specified results.
- New functions added for parsing input files and extracting checksums.

- **Bug Fixes**
- Improved error handling for validation failures, providing more
specific error messages.

- **Documentation**
- Updated the sample metadata schema with new fields for replicate count
and tissue origin, alongside a refined description for sample condition.
- Clarified usage of conda environments and apptainer integration within
Snakemake workflows.
- Expanded guidance on generating, customizing, and sharing reports in
Snakemake.
- Added documentation for integrating Xonsh scripts into Snakemake
rules.
- Updated help text for the `--keep-storage-local-copies` argument to
enhance clarity and usability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kjohnsen pushed a commit to kjohnsen/snakemake that referenced this pull request Dec 15, 2025
🤖 I have created a release *beep* *boop*
---


##
[9.0.0](snakemake/snakemake@v8.30.0...v9.0.0)
(2025-03-14)


### ⚠ BREAKING CHANGES

* Logging refactor & add LoggerPluginInterface
([snakemake#3107](snakemake#3107))

### Features

* [snakemake#3412](snakemake#3412) - keep
shadow folder of failed job if --keep-incomplete flag is set.
([snakemake#3430](snakemake#3430))
([22978c3](snakemake@22978c3))
* add flag --report-after-run to automatically generate the report after
a successfull workflow run
([snakemake#3428](snakemake#3428))
([b0a7f03](snakemake@b0a7f03))
* add flatten function to IO utils
([snakemake#3424](snakemake#3424))
([67fa392](snakemake@67fa392))
* add helper functions to parse input files
([snakemake#2918](snakemake#2918))
([63e45a7](snakemake@63e45a7))
* Add option to print redacted file names
([snakemake#3089](snakemake#3089))
([ba4d264](snakemake@ba4d264))
* add support for validation of polars dataframe and lazyframe
([snakemake#3262](snakemake#3262))
([c7473a6](snakemake@c7473a6))
* added support for rendering dag with mermaid js
([snakemake#3409](snakemake#3409))
([7bf8381](snakemake@7bf8381))
* adding --replace-workflow-config to fully replace workflow configs
(from config: directive) with --configfile, instead of merging them
([snakemake#3381](snakemake#3381))
([47504a0](snakemake@47504a0))
* Dynamic module name
([snakemake#3401](snakemake#3401))
([024dc32](snakemake@024dc32))
* Enable saving and reloading IOCache object
([snakemake#3386](snakemake#3386))
([c935953](snakemake@c935953))
* files added in rule params with workflow.source_path will be available
in used containers
([snakemake#3385](snakemake#3385))
([a6e45bf](snakemake@a6e45bf))
* Fix keep_local in storage directive and more freedom over remote
retrieval behaviour
([snakemake#3410](snakemake#3410))
([67b4739](snakemake@67b4739))
* inherit parameters of use rule and extend/replace individual items
them when using 'with' directive
([snakemake#3365](snakemake#3365))
([93e4b92](snakemake@93e4b92))
* Logging refactor & add LoggerPluginInterface
([snakemake#3107](snakemake#3107))
([86f1d6e](snakemake@86f1d6e))
* Maximal file size for checksums
([snakemake#3368](snakemake#3368))
([b039f8a](snakemake@b039f8a))
* Modernize package configuration using Pixi
([snakemake#3369](snakemake#3369))
([77992d8](snakemake@77992d8))
* multiext support for named input/output
([snakemake#3372](snakemake#3372))
([05e1378](snakemake@05e1378))
* optionally auto-group jobs via temp files in case of remote execution
([snakemake#3378](snakemake#3378))
([cc9bba2](snakemake@cc9bba2))


### Bug Fixes

* `--delete-all-output` ignores `--dry-run`
([snakemake#3265](snakemake#3265))
([23fef82](snakemake@23fef82))
* 3342 faster touch runs and warning messages for non-existing files
([snakemake#3398](snakemake#3398))
([cd9c3c3](snakemake@cd9c3c3))
* add default value to max-jobs-per-timespan
([snakemake#3043](snakemake#3043))
([2959abe](snakemake@2959abe))
* checkpoints inside modules are overwritten
([snakemake#3359](snakemake#3359))
([fba3ac7](snakemake@fba3ac7))
* Convert Path to IOFile
([snakemake#3405](snakemake#3405))
([c58684c](snakemake@c58684c))
* Do not perform storage object cleanup with --keep-storage-local-copies
set ([snakemake#3358](snakemake#3358))
([9a6d14b](snakemake@9a6d14b))
* edgecases of source deployment in case of remote execution
([snakemake#3396](snakemake#3396))
([5da13be](snakemake@5da13be))
* enhance error message formatting for strict DAG-building mode
([snakemake#3376](snakemake#3376))
([a1c39ee](snakemake@a1c39ee))
* fix bug in checkpoint handling that led to exceptions in case
checkpoint output was missing upon rerun
([snakemake#3423](snakemake#3423))
([8cf4a2f](snakemake@8cf4a2f))
* force check all required outputs
([snakemake#3341](snakemake#3341))
([495a4e7](snakemake@495a4e7))
* group job formatting
([snakemake#3442](snakemake#3442))
([f0b10a3](snakemake@f0b10a3))
* in remote jobs, upload storage in topological order such that
modification dates are preserved (e.g. in case of group jobs)
([snakemake#3377](snakemake#3377))
([eace08f](snakemake@eace08f))
* only skip eval when resource depends on input
([snakemake#3374](snakemake#3374))
([4574c92](snakemake@4574c92))
* Prevent execution of conda in apptainer when not explicitly requested
in software deployment method
([snakemake#3388](snakemake#3388))
([c43c5c0](snakemake@c43c5c0))
* print filenames with quotes around them in RuleException
([snakemake#3269](snakemake#3269))
([6baeda5](snakemake@6baeda5))
* Re-evaluation of free resources
([snakemake#3399](snakemake#3399))
([6371293](snakemake@6371293))
* ReadTheDocs layout issue due to src directory change
([snakemake#3419](snakemake#3419))
([695b127](snakemake@695b127))
* robustly escaping quotes in generated bash scripts (v2)
([snakemake#3297](snakemake#3297))
([snakemake#3389](snakemake#3389))
([58720bd](snakemake@58720bd))
* Show apptainer image URL in snakemake report
([snakemake#3407](snakemake#3407))
([45f0450](snakemake@45f0450))
* Update ReadTheDocs configuration for documentation build to use Pixi
([snakemake#3433](snakemake#3433))
([3f227a6](snakemake@3f227a6))


### Documentation

* Add pixi setup instructions to general use tutorial
([snakemake#3382](snakemake#3382))
([115e81b](snakemake@115e81b))
* fix contribution section heading levels, fix docs testing setup order
([snakemake#3360](snakemake#3360))
([051dc53](snakemake@051dc53))
* fix link to github.com/snakemake/poetry-snakemake-plugin
([snakemake#3436](snakemake#3436))
([ec6d97c](snakemake@ec6d97c))
* fix quoting
([snakemake#3394](snakemake#3394))
([b40f599](snakemake@b40f599))
* fix rerun-triggers default
([snakemake#3403](snakemake#3403))
([4430e23](snakemake@4430e23))
* fix typo 'safe' -&gt; 'save'
([snakemake#3384](snakemake#3384))
([7755861](snakemake@7755861))
* mention code formatting in the contribution section
([snakemake#3431](snakemake#3431))
([e8682b7](snakemake@e8682b7))
* remove duplicated 'functions'.
([snakemake#3356](snakemake#3356))
([7c595db](snakemake@7c595db))
* update broken links documentation
([snakemake#3437](snakemake#3437))
([e3d0d88](snakemake@e3d0d88))
* Updating contributing guidelines with new pixi dev setup
([snakemake#3415](snakemake#3415))
([8e95a12](snakemake@8e95a12))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: snakemake-bot <snakemake-bot-admin@googlegroups.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants