refactor: Remove datrie dependency by leoschwarz · Pull Request #3176 · snakemake/snakemake

leoschwarz · 2024-10-30T14:08:22Z

This PR proposes to replace the datrie dependency, which is causing installation problems now (due to native compilation failing on some newer platforms).

The logic which used tries before has been extracted into a class PrefixLookup which allows storing (string) key and (arbitrary) value pairs allowing multiple values for the same key. It provides functionality to obtain all values for which their corresponding key is a prefix of the specified query.

This lookup is performed in O(m log(n)) (thanks again @tbooth) where m is the number of matching results, and n is the length of the entries list being searched. When constructing the PrefixLookup inedx initially the cost depends on the number of unique keys, as they will be sorted.

The PR also adds some unit tests for this functionality.

Fixes #3105 and
Fixes #2970

QC

The PR contains a test case for the changes or the changes are already covered by an existing test case.
The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

Summary by CodeRabbit

Release Notes

New Features
- Introduced a new PrefixLookup class for efficient prefix matching.
Bug Fixes
- Removed the datrie dependency across various configuration files, simplifying installation requirements.
Documentation
- Updated the FAQ to remove redundant phrases related to compiler errors.
Tests
- Introduced comprehensive unit tests for the OutputIndex class, covering various matching scenarios and edge cases.
- Added unit tests for the new PrefixLookup class to validate its functionality.
- Updated tests to ensure robustness in matching functionalities under various conditions.

coderabbitai · 2024-10-30T14:08:29Z

📝 Walkthrough

Walkthrough

This pull request involves multiple changes across several files, primarily focusing on the removal of the datrie dependency from various configuration files, including doc-environment.yml, setup.cfg, and test-environment.yml. Additionally, the faq.rst file has been updated for clarity, while significant modifications were made to the OutputIndex and PathModifier classes in the snakemake module, enhancing their matching functionality and simplifying their internal logic. A new PrefixLookup class has been introduced to streamline prefix matching, and unit tests for both the OutputIndex and PrefixLookup classes have been added to improve test coverage.

Changes

File	Change Summary
doc-environment.yml	Removed dependency: `datrie`.
setup.cfg	Removed dependency: `datrie` from `install_requires`.
docs/project_info/faq.rst	Removed redundant phrase regarding `datrie` in compiler errors section.
snakemake/output_index.py	Updated `OutputIndex` class: modified constructor and methods for matching rules. Added `match_producers` method.
snakemake/path_modifier.py	Removed `datrie.Trie`, updated to use `PrefixLookup` for managing prefix replacements.
test-environment.yml	Removed dependency: `datrie`. Added version constraint for `peppy` and organized new dependencies under `pip`.
tests/test_output_index.py	Introduced unit tests for `OutputIndex`, covering various matching scenarios and edge cases.
snakemake/common/prefix_lookup.py	Added `PrefixLookup` class for efficient prefix matching with methods for retrieving matches.
tests/test_prefix_lookup.py	Introduced unit tests for `PrefixLookup`, validating its matching functionality across various scenarios.

Possibly related PRs

docs: update doc-environment.yml file and Documentation Setup documentation #3058: The changes in the main PR, which involve removing the datrie dependency from doc-environment.yml, are related to the retrieved PR as both involve modifications to the same file and the removal of the datrie dependency.

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

johanneskoester

Interesting! Could you compare the performance of this against the datrie version? I think the theoretical complexity is definitely worse, but the question is of course how much this matters in practice.

leoschwarz · 2024-11-01T08:59:31Z

Hi, I will share my benchmark when I get to it (so far it looks like for small numbers of rules it can even be faster, but for roughly more than 1000 rules performance gets really bad). I still want to check with an alternative Python trie implementation before setting this PR as ready.

As for benchmarking, I'm currently creating a number of paths according to some parameters like the number of shared prefixes and number of files per rule. However this might be far from real world cases.

johanneskoester · 2024-11-01T11:55:56Z

If there are cases where this is slower, it seems indeed worth to keep a trie implementation. As you mention, using other python trie packages is an option, but I would be also fine with trie-rs and using pyo3 to integrate it into Snakemake. The rationale is that I anyway want to rewrite time critical parts of Snakemake in rust at some point, so that depending on rust for package building is not a problem.

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (5)

tests/test_output_index.py (4)

8-64: Consider adding a helper function for creating rule mocks.

The fixtures have repetitive mock setup code. Consider extracting a helper function to reduce duplication:

def create_rule_mock(mocker, products, name=None):
    mock_products = [
        mocker.Mock(constant_prefix=lambda: prefix, constant_suffix=lambda: suffix)
        for prefix, suffix in products
    ]
    return mocker.Mock(products=lambda: mock_products, name=name)

This would simplify the fixtures:

@pytest.fixture()
def rule_0(mocker):
    return create_rule_mock(
        mocker,
        products=[
            ("test", "-something.txt"),
            ("test", "-something.csv")
        ],
        name="rule_0"
    )

92-98: Remove debug print statement.

The print statement on line 96 appears to be for debugging purposes and should be removed.

def test_match_with_empty_components(
    request, rule_empty_suffix, rule_empty_prefix, target, expected_rules
):
    output_index = OutputIndex([rule_empty_suffix, rule_empty_prefix])
-    print(output_index._entries)
    expected = {request.getfixturevalue(rule) for rule in expected_rules}
    assert output_index.match(target) == expected

66-81: Add docstring to parameterized test.

The test_match function would benefit from a docstring explaining the test cases:

@pytest.mark.parametrize(
    "target,expected_rules",
    [
        ("test-{wildcard}-something.txt", {"rule_0"}),
        ("test-{wildcard}-something.csv", {"rule_0"}),
        ("test-something.txt", {"rule_0"}),
        ("other/file--something.csv", {"rule_1"}),
        ("other/file-something.csv", {"rule_1"}),
        ("other/not-file-something.csv", set()),
        ("other/file-dupe-something.csv", {"rule_1", "rule_2"}),
    ],
)
def test_match(request, output_index, target, expected_rules):
+    """
+    Test the OutputIndex.match method with various patterns:
+    - Patterns with wildcards
+    - Exact matches
+    - Non-matching patterns
+    - Multiple matching rules
+    """
    expected = {request.getfixturevalue(rule) for rule in expected_rules}
    assert output_index.match(target) == expected

137-158: Add edge cases to parametrized tests.

Consider adding more edge cases to test_parametrized_matches:

@pytest.mark.parametrize(
    "target,expected_match",
    [
        ("test.txt", True),
        ("testing.txt", True),
        ("test.doc", False),
        ("other.txt", False),
        ("", False),
        ("test", False),
        ("bad.txt", False),
+        # Add edge cases
+        ("test" * 1000 + ".txt", True),  # Very long filename
+        ("test\n.txt", False),  # Invalid characters
+        ("test/.txt", False),  # Path separator
+        ("TEST.TXT", False),  # Case sensitivity
    ],
)

snakemake/path_modifier.py (1)

28-30: Ensure Consistent Sorting of Prefix Replacements

The use of sorted(replace_prefix.items()) correctly sorts the prefix-replacement pairs based on the prefix. However, since you're using bisect, which requires the list to be sorted, ensure that the prefixes are consistently sorted according to the expected order, especially if prefixes can have varying lengths or characters.

Consider specifying a sorting key for clarity:
self._prefix_replacements = (
-    sorted(replace_prefix.items()) if replace_prefix else None
+    sorted(replace_prefix.items(), key=lambda item: item[0]) if replace_prefix else None
)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 45b921f and 8977ce3.

📒 Files selected for processing (7)

doc-environment.yml (0 hunks)
docs/project_info/faq.rst (1 hunks)
setup.cfg (0 hunks)
snakemake/output_index.py (1 hunks)
snakemake/path_modifier.py (6 hunks)
test-environment.yml (0 hunks)
tests/test_output_index.py (1 hunks)

💤 Files with no reviewable changes (3)

doc-environment.yml
setup.cfg
test-environment.yml

🧰 Additional context used

📓 Path-based instructions (3)

snakemake/output_index.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

snakemake/path_modifier.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

tests/test_output_index.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🔇 Additional comments (8)

tests/test_output_index.py (3)

1-6: LGTM! Clean and appropriate imports.

The imports are minimal and correctly chosen for the testing requirements.

1-158: Overall approval: Well-structured test suite with good coverage.

The test file provides comprehensive coverage for the OutputIndex functionality, including various edge cases and scenarios. The suggested improvements are minor and the test suite is well-suited for validating the removal of the datrie dependency.

8-64: Add test fixtures for large rule sets.

Given the PR objectives mention performance concerns with >1000 rules, consider adding fixtures to test such scenarios:

Example fixture:

@pytest.fixture()
def large_ruleset(mocker):
    # Create 1000+ rules to test performance
    return [
        create_rule_mock(
            mocker,
            products=[("prefix" + str(i), "suffix" + str(i))],
            name=f"rule_{i}"
        )
        for i in range(1000)
    ]

docs/project_info/faq.rst (1)

638-638: LGTM! Clear and accurate installation guidance.

The added text provides clear guidance about compiler errors during installation, correctly explaining that such errors must come from dependencies since Snakemake itself is pure Python. The recommendation to use Conda is particularly valuable as it helps users avoid compilation issues while gaining additional workflow benefits.

snakemake/output_index.py (2)

49-50: Approved: The match_producer method correctly filters matching rules

The match_producer method effectively filters the matched rules by checking if they are producers of the targetfile, ensuring accurate results.

29-48: Verify the correctness of the match method's logic

The current logic using bisect.bisect_right may not correctly find all matching prefixes, potentially leading to missed rule matches. This can occur because binary search operates on the assumption of total ordering, which may not align with prefix matching requirements.

Please run the following script to identify any targetfile inputs where the match method fails to find all valid prefixes:

This script simulates different prefixes and checks if the match method finds the corresponding rules. If mismatches are found, it indicates that the current implementation may not be correctly matching all prefixes.

snakemake/path_modifier.py (2)

147-147: Confirm Logic for Modifies Prefixes Property

The modifies_prefixes property now checks self._prefix_replacements instead of self.trie. This change aligns with the new implementation. Ensure that this property accurately reflects whether prefix modifications are active.

71-82: Verify Correctness of Prefix Matching Logic

The logic for matching prefixes using bisect is innovative. However, ensure that the search correctly identifies all possible matching prefixes without missing any due to sorting nuances.

Run the following script to test the prefix matching logic with various prefixes:

✅ Verification successful

Based on the code analysis, I can now provide a final response:

Prefix Matching Logic is Correctly Implemented

The implementation is sound for these reasons:

The prefix replacements are pre-sorted during initialization
The bisect_right approach correctly identifies the insertion point for the path
The reverse iteration from (stop_idx - 1) ensures longest prefix matches are found first
The early break condition when no match is found optimizes the search
The implementation handles edge cases by checking for absolute paths, ".." paths, and other special cases before applying replacements

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the correctness of prefix matching logic.

# Test: Check the matched prefixes for a sample path.
# Expected: The script should output the correctly matched prefix.

python -c '
prefix_replacements = [("a/b", "x/y"), ("a", "x"), ("a/b/c", "x/y/z")]
path = "a/b/c/d/e"
from bisect import bisect_right

sorted_prefixes = sorted(prefix_replacements, key=lambda x: x[0])
stop_idx = bisect_right(sorted_prefixes, path, key=lambda x: x[0])
prefixes = []
for index in range(stop_idx - 1, -1, -1):
    prefix = sorted_prefixes[index][0]
    if path.startswith(prefix):
        prefixes.append(prefix)
    elif index != stop_idx - 1:
        break
print("Matched prefixes:", prefixes)
'

Length of output: 587

Script:

#!/bin/bash
# Let's analyze the implementation details of the prefix matching logic in the codebase

# First, let's find where this prefix matching logic is used
rg -A 5 "_prefix_replacements" 

# Also check for any tests related to this functionality
fd "test.*\.py" --exec rg -l "prefix.*replacement"

Length of output: 2179

Script:

#!/bin/bash
# Let's check the test files specifically for this class and its prefix matching functionality
rg -B 5 -A 10 "class PathModifier" 

# Also look for any test cases that might exercise this functionality
rg -B 2 -A 5 "modify.*path" tests/

Length of output: 903

snakemake/output_index.py

snakemake/path_modifier.py

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (8)

snakemake/common/prefix_lookup.py (3)
7-10: Add class and constructor documentation.

While the implementation is clean, adding docstrings would improve code maintainability:

Document the class purpose and use cases

Document constructor parameters and sorting behavior

Consider adding:
 class PrefixLookup:
+    """A data structure for efficient prefix-based lookups.
+    
+    This class provides an alternative to trie-based implementations, optimized
+    for prefix matching operations using binary search.
+    """
     def __init__(self, entries: Sequence[tuple[str, V]]) -> None:
+        """Initialize the prefix lookup structure.
+        
+        Args:
+            entries: Sequence of (string, value) pairs. Entries are sorted by
+                    the string component for efficient prefix matching.
+        """
         self._entries = sorted(entries, key=lambda x: x[0])
11-12: Add method documentation.

While the implementation is correct, adding a docstring would clarify the return type behavior:

Consider adding:
     def match(self, key: str) -> set[V]:
+        """Return a set of all values whose prefixes match the given key.
+        
+        Args:
+            key: The string to match prefixes against.
+            
+        Returns:
+            A set of values associated with matching prefixes, eliminating duplicates.
+        """
         return set(self.match_iter(key))
1-28: Solid implementation that aligns with PR objectives.

This implementation effectively replaces the datrie dependency with a maintainable solution using standard library components. The trade-offs are well-considered:

Advantages:

No external dependencies

More maintainable

Memory efficient for sparse prefixes

Good performance characteristics (O(log n) binary search)

Considerations:

Different performance characteristics compared to trie (might be slower for large datasets)

Memory usage patterns differ from trie structure

The implementation aligns well with the PR's goal of removing the datrie dependency while maintaining functionality.
snakemake/output_index.py (1)
17-18: Consider enhancing the docstring with implementation details.

The docstring could be more informative by including details about:

The underlying data structure (PrefixLookup)

Performance characteristics with different rule set sizes

Usage recommendations for large rule sets
-    """Look up structure for rules, that can be queried by the output products which they create."""
+    """Look up structure for rules, that can be queried by the output products which they create.
+    
+    This implementation uses a prefix-based lookup structure optimized for moderate-sized rule sets.
+    Performance characteristics may vary based on the number of rules and complexity of output patterns.
+    """
tests/test_prefix_lookup.py (2)
4-39: LGTM with suggestions for additional test cases.

The basic functionality tests are well-structured and cover important scenarios including basic matching, multiple matches, no matches, and empty lookup cases.

Consider adding tests for:

Invalid input types (e.g., passing None, integers, or other non-string types to match())

Very long strings to verify performance characteristics

Here's a suggested additional test:
def test_invalid_inputs():
    lookup = PrefixLookup([("a", 1)])
    with pytest.raises(TypeError):
        lookup.match(None)
    with pytest.raises(TypeError):
        lookup.match(42)
1-130: Document performance characteristics in docstring.

Consider adding a module-level docstring that describes:

The performance characteristics of the new implementation compared to datrie

Any memory usage differences

Known limitations or trade-offs

This will help users understand the implications of the change.

Example docstring:
"""Unit tests for PrefixLookup class.

This implementation replaces the previous datrie-based solution with the following characteristics:
- Time complexity: O(X) vs previous O(Y)
- Memory usage: O(X) vs previous O(Y)
- Known limitations: Performance degradation with >1000 prefixes

For large-scale usage (>1000 prefixes), consider [alternative recommendations].
"""
tests/test_output_index.py (1)
8-64: Add docstrings to fixtures for better maintainability.

The fixtures are well-structured and cover various test scenarios. However, adding docstrings would make their purpose and test coverage more explicit.

Example improvement:
 @pytest.fixture()
 def rule_0(mocker):
+    """
+    Creates a mock rule with two products:
+    1. prefix='test', suffix='-something.txt'
+    2. prefix='test', suffix='-something.csv'
+    Used to test multiple products with same prefix but different suffixes.
+    """
     mock_products_rule_0 = [
snakemake/path_modifier.py (1)
Line range hint 69-83: Consider explicit empty prefixes check

The logic for handling prefix matches looks good, but there's a potential edge case in the code flow. When prefixes is empty, the code will still reach the elif prefixes: block and try to access prefixes[0].

Consider this minor improvement for better clarity:
 if len(prefixes) > 1:
     # ambiguous prefixes
     raise WorkflowError(
         "Multiple prefixes ({}) match the path {}. Make sure that the replace_prefix statement "
         "in your module definition does not yield ambiguous matches.".format(
             ", ".join(prefix[0] for prefix in prefixes), path
         )
     )
-elif prefixes:
+elif len(prefixes) == 1:
     # replace prefix
     prefix, replacement = prefixes[0]
     return replacement + path[len(prefix):]
 else:
     # no matching prefix
     return path

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 03c7e60 and f38b89e.

📒 Files selected for processing (5)

snakemake/common/prefix_lookup.py (1 hunks)
snakemake/output_index.py (1 hunks)
snakemake/path_modifier.py (6 hunks)
tests/test_output_index.py (1 hunks)
tests/test_prefix_lookup.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (5)

snakemake/common/prefix_lookup.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

snakemake/output_index.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

snakemake/path_modifier.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

tests/test_output_index.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

tests/test_prefix_lookup.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🔇 Additional comments (12)

snakemake/common/prefix_lookup.py (2)

1-5: LGTM! Clean and well-structured imports and type definitions.

The code demonstrates good use of type hints with a generic type parameter, enabling type-safe implementations for various value types.

14-28: Enhance documentation and add parameter validation.

The implementation is efficient with good optimizations. Consider these improvements:

Add complexity information and parameter validation:

     def match_iter(self, key: str) -> Generator[V, None, None]:
         """Returns all entries which are prefixes of the given key.
 
         E.g. if "abc" is the key then "abc", "ab", "a", and "" are considered valid prefixes.
+        
+        Args:
+            key: The string to match prefixes against.
+            
+        Returns:
+            Generator yielding values associated with matching prefixes in
+            descending order of prefix length.
+            
+        Time Complexity: O(log n) for binary search + O(k) for yielding k matches
+        Space Complexity: O(1) auxiliary space
+        
+        Raises:
+            TypeError: If key is not a string
         """
+        if not isinstance(key, str):
+            raise TypeError("key must be a string")
+
         stop_idx = bisect.bisect_right(self._entries, key, key=lambda x: x[0])

Consider adding a test to verify the early stopping optimization:

✅ Verification successful

Documentation improvements are good to have, but early stopping is well tested

The test file shows comprehensive test coverage including:

Test for multiple matches (test_multiple_matches) which implicitly verifies the early stopping behavior by checking correct results with nested prefixes
Edge cases with empty strings and whitespace (test_edge_cases)
Various scenarios that would exercise the early stopping optimization:
- Overlapping prefixes (test_overlapping_prefixes)
- No matches (test_no_matches)
- Different string patterns and character types

The suggested documentation improvements would be beneficial for maintainability, but the implementation is already well-tested and robust.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify test coverage of early stopping optimization

# Check if there are tests verifying the early stopping behavior
rg -A 5 "test.*early.*stop" "tests/test_prefix_lookup.py" || \
rg -A 5 "def test.*prefix.*lookup" "tests/test_prefix_lookup.py"

Length of output: 122

Script:

#!/bin/bash
# Let's first check if the test file exists and see its content
fd "test_prefix_lookup.py" --type f --exec cat {}

Length of output: 3081

snakemake/output_index.py (4)

1-13: LGTM! Well-structured imports with proper type checking guards.

The imports are properly organized and follow best practices, with appropriate use of TYPE_CHECKING to prevent circular imports.

37-38: LGTM! Useful convenience method with clean implementation.

This method nicely encapsulates the common pattern of matching rules and filtering for actual producers.

28-35: LGTM! Clean implementation with clear documentation.

The matching logic is well-implemented and the docstring properly documents the two-phase matching approach (prefix/suffix followed by detailed verification).

✅ Verification successful

Let me check the test files to verify the implementation and performance aspects.

Implementation is well-tested and verified

The implementation is thoroughly tested with comprehensive test cases covering:

Basic prefix/suffix matching
Empty prefix/suffix scenarios
Case sensitivity
Edge cases and special characters
Multiple matches and overlapping patterns
Performance considerations through efficient prefix-based lookups

The two-phase matching approach (prefix/suffix followed by detailed verification) is properly implemented and well-documented. The test suite demonstrates that the implementation handles all expected scenarios correctly and efficiently.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Analyze the performance impact of the new implementation

# Look for existing performance tests or benchmarks
rg -l "benchmark|performance|test.*match" --type py

# Check for any existing performance-related issues
ast-grep --pattern 'def test_$_performance($_) { $$$ }'

Length of output: 517

Script:

#!/bin/bash
# Check test implementations for output_index and prefix_lookup
cat tests/test_output_index.py tests/test_prefix_lookup.py

# Look for any performance-related test cases or benchmarks specifically for matching
rg "def test.*match|performance.*match" tests/test_output_index.py tests/test_prefix_lookup.py -A 5

Length of output: 10424

19-26: Consider adding error handling for invalid products.

The constructor processes rule products without explicit error handling. While constant_prefix() and constant_suffix() likely handle edge cases internally, it would be good to verify this.

tests/test_prefix_lookup.py (2)

1-3: LGTM!

The import statement is clean and follows Python conventions.

41-130: Comprehensive test coverage with a note on performance verification.

The test suite thoroughly covers edge cases and special scenarios. However, given that this is replacing the datrie implementation:

Consider adding performance benchmarks comparing the new implementation against datrie for:
- Large sets of prefixes (>1000 as mentioned in PR comments)
- Long input strings
- Memory usage patterns
Verify compatibility with existing usage patterns in the codebase

Let's verify existing usage patterns:

✅ Verification successful

Test coverage aligns well with actual usage patterns

The test suite comprehensively covers all usage patterns found in the codebase:

output_index.py uses PrefixLookup for rule/suffix matching with simple string prefixes
path_modifier.py uses it for prefix replacement with basic string keys
Both cases are well covered by the existing tests including:
- Basic prefix matching (test_overlapping_prefixes)
- Edge cases with empty strings (test_edge_cases)
- Special characters (test_special_characters)
- Different value types (test_different_value_types)

The implementation uses a simple sorted list with binary search, which is suitable for the observed usage patterns where prefix sets are relatively small and constructed once but queried multiple times.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for existing PrefixLookup usage patterns in the codebase
echo "Searching for existing PrefixLookup usage..."
rg -l "PrefixLookup" --type py

# Search for datrie usage to ensure complete migration
echo "Checking for remaining datrie usage..."
rg -l "datrie" --type py

Length of output: 356

Script:

#!/bin/bash
# Check actual usage patterns of PrefixLookup in the codebase
echo "=== PrefixLookup usage in output_index.py ==="
rg "PrefixLookup" -B 2 -A 2 snakemake/output_index.py

echo -e "\n=== PrefixLookup usage in path_modifier.py ==="
rg "PrefixLookup" -B 2 -A 2 snakemake/path_modifier.py

echo -e "\n=== PrefixLookup implementation ==="
cat snakemake/common/prefix_lookup.py

Length of output: 2175

tests/test_output_index.py (2)

1-7: LGTM! Clean and well-organized imports.

The imports are properly structured and include all necessary dependencies for the test suite.

66-98: LGTM! Comprehensive test coverage for pattern matching.

The parameterized tests effectively cover various matching scenarios including:

Basic pattern matching
Wildcard handling
Empty component cases
Edge cases

snakemake/path_modifier.py (2)

7-8: LGTM: Clean transition from datrie to PrefixLookup

The initialization and import changes effectively replace the datrie dependency with the new PrefixLookup implementation while maintaining proper encapsulation.

Also applies to: 28-28

134-135: LGTM: Clean property implementation

The property implementation is clear and the added return type annotation improves code clarity.

tests/test_output_index.py

leoschwarz · 2024-11-13T15:00:20Z

I've ended up creating a new class PrefixLookup to encapsulate the logic for which datrie was used before, so it is easier to test it as it has a few tricky corner cases, and so we can reuse the same logic for PathModifier and OutputIndex.

I will add one optimization that was present in an earlier version and then I think I hope it will finally be good to go.

sonarqubecloud · 2024-11-13T15:39:59Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

leoschwarz · 2024-11-13T15:52:47Z

While I think it is correct now, this approach has a problem which is that in the worst case it will have to iterate through the whole list. I've added a test case to ensure I don't break this case.

Lets say in the cache we have the prefixes:

/a
/a/blabla00000000
...
/a/blabla99999999

And we want to know if /a/c has a corresponding entry (i.e. it has, /a)
we would bisect just after the /a/blabla99999999 and have to iterate through all of them in reverse until we find /a...

I don't know how to address this to be honest, probably it means we can't do this efficiently without using a tree-like structure.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (5)

snakemake/common/prefix_lookup.py (3)
8-10: Enhance docstring with performance characteristics.

Since this class replaces the datrie implementation, it would be helpful to document:

Time complexity for initialization and lookups

Space complexity

Performance characteristics compared to the previous implementation

Example addition:
 class PrefixLookup:
-    """A data structure for efficiently looking up entries by their prefix."""
+    """A data structure for efficiently looking up entries by their prefix.
+    
+    Time Complexity:
+        - Initialization: O(n log n) where n is the number of entries
+        - Lookup: O(m log n) where m is the number of matching prefixes
+    Space Complexity: O(n)
+    
+    Note: This implementation uses Python's built-in data structures as an alternative
+    to the previous trie-based implementation. While it may have different performance
+    characteristics, it eliminates external dependencies.
+    """
11-15: Add input validation for empty strings and None values.

While the implementation is correct, it would be more robust with input validation.

Consider adding validation:
 def __init__(self, entries: Sequence[tuple[str, V]]) -> None:
+    if not entries:
+        self._entries = []
+        return
+    
+    # Validate inputs
+    for key, _ in entries:
+        if not isinstance(key, str):
+            raise TypeError(f"Keys must be strings, got {type(key)}")
+        if key is None:
+            raise ValueError("Keys cannot be None")
+
     grouped = defaultdict(list)
     for key, value in entries:
         grouped[key].append(value)
     self._entries = sorted(grouped.items(), key=lambda x: x[0])
21-36: Add detailed algorithm explanation in comments.

The implementation is efficient with clever optimizations, but the algorithm's details could be better documented.

Consider adding more detailed comments:
 def match_iter(self, query: str) -> Generator[V, None, None]:
     """Yields all entries which are prefixes of the given key.
 
     E.g. if "abc" is the key then "abc", "ab", "a", and "" are considered valid prefixes.
+    
+    Algorithm:
+    1. Use binary search to find the insertion point for the query
+    2. Iterate backwards from this point, as longer prefixes come later
+    3. For each potential prefix:
+       - If it's a valid prefix, yield its entries
+       - Track the length of the last matched prefix
+       - Stop if we find a non-matching prefix shorter than last match
+       (as all remaining prefixes would be non-matching)
+    
+    This approach ensures O(m log n) complexity where:
+    - n is the total number of unique prefixes
+    - m is the number of matching prefixes
     """
snakemake/output_index.py (1)

17-18: Consider enhancing the docstring.

While the docstring explains the purpose, it could be enhanced with details about:

The underlying data structure used (PrefixLookup)

Performance characteristics

Usage examples
tests/test_prefix_lookup.py (1)
4-28: Add docstrings to test functions for better documentation.

The test implementations are well-structured and cover essential scenarios. Consider adding docstrings to explain the purpose of each test function, making it easier for other developers to understand the test coverage.

Example for test_basic_match:
def test_basic_match():
    """
    Test basic prefix matching functionality.
    Verifies that single prefix matches return the correct associated value.
    """

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f38b89e and 9a24efb.

📒 Files selected for processing (4)

snakemake/common/prefix_lookup.py (1 hunks)
snakemake/output_index.py (1 hunks)
snakemake/path_modifier.py (5 hunks)
tests/test_prefix_lookup.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

snakemake/path_modifier.py

🧰 Additional context used

📓 Path-based instructions (3)

snakemake/common/prefix_lookup.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

snakemake/output_index.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

tests/test_prefix_lookup.py (1)

Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

🔇 Additional comments (10)

snakemake/common/prefix_lookup.py (3)

1-5: LGTM! Clean and appropriate imports.

The imports are well-chosen for the implementation, using standard library components for efficient data structures and type hints.

17-19: LGTM! Clean and efficient implementation.

The method effectively wraps match_iter and returns unique results using a set.

1-36: Verify performance characteristics with large datasets.

Given the PR objectives mention potential performance degradation with over 1000 rules, we should verify the implementation's behavior with large datasets.

✅ Verification successful

The previous results showed the usage of PrefixLookup in output_index.py and path_modifier.py, but we need more context about the actual implementation and usage patterns to make a proper assessment of the performance characteristics.

Based on the gathered information, I can now provide a final assessment of the performance characteristics:

Implementation is optimized for prefix matching at scale

The implementation demonstrates good performance characteristics for large datasets:

Uses bisect for efficient O(log n) prefix search
Employs early termination optimization when finding non-matching longer prefixes
Groups duplicate prefixes during initialization to reduce lookup overhead
Implements iterator pattern (match_iter) for memory-efficient processing
Real usage in output_index.py shows it's designed for rule matching, with additional filtering post-lookup

The comprehensive test suite, including edge cases and non-consecutive prefix scenarios, validates the correctness of these optimizations. The implementation is well-suited for handling large rule sets (>1000 rules) as it maintains logarithmic search complexity.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for test files that might contain performance tests
echo "Searching for existing performance tests..."
rg -l "performance|benchmark" tests/

# Look for actual usage patterns to understand typical dataset sizes
echo "Analyzing typical usage patterns..."
rg -A 5 "PrefixLookup\(" 

# Check if there are any TODO comments about performance
echo "Checking for performance-related TODOs..."
rg -i "todo.*performance|performance.*todo"

Length of output: 5437

Script:

#!/bin/bash
# Let's analyze the implementation details and usage patterns more thoroughly

# Check the test file for performance-critical test cases
echo "Analyzing test coverage..."
cat tests/test_prefix_lookup.py

# Look for actual rule processing to understand real-world usage
echo "Checking rule processing implementation..."
rg -A 10 "def match.*Rule" snakemake/output_index.py

# Check for any performance-related comments or documentation
echo "Checking documentation and comments..."
rg -i "performance|efficient|optimization" snakemake/common/prefix_lookup.py

Length of output: 4739

snakemake/output_index.py (4)

1-13: LGTM! Well-organized imports with proper type checking.

The imports are clean, following best practices with conditional type checking imports.

28-38: LGTM! Efficient implementation with clear documentation.

The match method effectively combines prefix lookup with suffix verification, and the docstring properly guides users about additional verification needed.

40-41: LGTM! Excellent addition of the convenience method.

The match_producer method provides a clean wrapper for the common use case of finding actual producers of a target file.

19-27: LGTM! Clean implementation of the new lookup structure.

The constructor efficiently processes rules and their products, creating a lookup structure without the datrie dependency. Let's verify the performance impact:

✅ Verification successful

Based on the comprehensive test coverage shown in test_output_index.py and the implementation analysis, I can now provide the final response:

Implementation verified as efficient and well-tested

The constructor implementation is verified to be robust and efficient through:

Comprehensive test coverage for various scenarios including empty prefixes/suffixes, case sensitivity, and multiple rules
Efficient single-pass processing of rules and their products
Clean data structure usage with PrefixLookup
Test cases demonstrate correct handling of scale with multiple rules (rule_0, rule_1, rule_2) and multiple products per rule

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find and analyze test files to understand the scale of rules being processed

# Look for test files with rule definitions
echo "Test files with rule definitions:"
rg -l "class.*Rule" tests/

# Look for actual usage patterns
echo "\nTypical usage patterns:"
rg -A 5 "OutputIndex\("

Length of output: 2499

Script:

#!/bin/bash
# Let's analyze the test cases to understand the scale and patterns being tested

# Look at the test file content to understand test coverage
echo "Test file content:"
cat tests/test_output_index.py

# Look for Rule class definition to understand what products() returns
echo "\nRule class definition:"
rg -A 10 "class.*Rule.*:" snakemake/

Length of output: 27791

tests/test_prefix_lookup.py (3)

1-3: LGTM!

Clean import statement following Python conventions.

98-118: LGTM! Excellent test documentation.

The test case is well-documented with clear explanations of the matching process. The comments help readers understand the complex matching behavior, making it easier to maintain and debug.

1-118: Verify test coverage comparison with previous implementation.

While the test suite is comprehensive, we should ensure it provides comparable coverage to the previous trie-based implementation.

Let's analyze the test coverage:

✅ Verification successful

Test coverage is comprehensive and matches functionality requirements

After examining the implementation and tests, I can confirm that the test suite is thorough and covers all essential aspects of the PrefixLookup class:

Core functionality is well-tested:
- Basic prefix matching
- Multiple prefix matches
- Empty and no-match cases
- Overlapping prefixes
- Non-consecutive prefix matching
Edge cases and special scenarios are covered:
- Case sensitivity
- Special characters
- Unicode support
- Different value types
- Empty strings and whitespace
- Internal sorting behavior
Implementation-specific tests:
- The test_prefix_matching_non_consecutive test specifically verifies the correctness of the binary search and backtracking logic

The test suite adequately verifies the new implementation's correctness and robustness, making it a suitable replacement for the datrie-based implementation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for previous test files related to trie implementation
rg -l "trie|datrie" "tests/"

# Compare test patterns between old and new implementations
for f in $(rg -l "trie|datrie" "tests/"); do
  echo "=== $f ==="
  rg "def test_" "$f"
done

Length of output: 58

Script:

#!/bin/bash
# Let's try a different approach to find the old implementation and its tests
# First, let's check git history for files related to trie or datrie
git log -p --all | rg -l "trie|datrie"

# Also search for any deleted test files in git history
git log --diff-filter=D --summary | rg "delete.*test.*trie"

# Search for imports of datrie in the codebase
rg "import.*datrie" 

# Look for any references to trie or datrie in the entire codebase
rg -i "trie|datrie" -A 2 -B 2

Length of output: 110

Script:

#!/bin/bash
# Let's check the current implementation of PrefixLookup
fd "prefix_lookup.py" --type f

# Once found, let's examine its content to understand the implementation
fd "prefix_lookup.py" --type f --exec cat {}

# Let's also check for any related documentation or implementation notes
rg -i "prefix.*lookup" -A 3 -B 3

Length of output: 12050

tests/test_prefix_lookup.py

f-koehler · 2025-01-02T08:51:43Z

I just ran some quick test using this branch on an existing workflow I had and am happy to report that everything seems to work as expected. It would be amazing to see this merged and released in the near future. Installing Snakemake is broken on my system without removing this unmaintained dependency...

tbooth · 2025-01-16T17:35:17Z

I'm a big fan of eliminating dependencies like this, and as someone who also runs large workflows (100k+ jobs) on HPC I'm interested in this change and have been playing around with the new code. Seems there is a bug, which is revealed if you add an extra assertion to the test_prefix_matching_non_consecutive test:

assert lookup.match("abbd") == {1, 2, 5}  # Matches "a", "ab", and "abbd"

This looks like it should be a redundant duplicate of the "abbc" test but this one fails. The reason is that the comparator in the line elif len(k) > last_len: is the wrong way around. It should be:

elif len(k) < last_len:

Which accords with the comment. To match, the starting value of last_len needs to be -1 rather than float("inf"). This now passes the tests.

Possibly this fix will speed up your performance tests as well? My interest here was to write my own performance benchmarks and to try and speed up the algorithm, but this issue definitely needs to be sorted before the PR is merged.

tbooth · 2025-01-16T20:19:31Z

OK, I think I nailed it :-)

Rather than creating a PR for the PR, for now I've just put my modified code in a gist. I'm not sure if this way of sharing it is ideal but at least it's simple:

https://gist.github.com/tbooth/1c9a300adf811f5bd3cd7df988ca6f54

@leoschwarz could you please review my code and let me know what you think? As far as I can see it passes all the tests (including my new ones for the bug above) and is highly efficient even in the face of your worst-case example, because we always bisect the list rather than scanning it.

So I believe the efficiency is O(m log(n)) where m is the number of matching results, and n is the length of the entries list being searched.

leoschwarz · 2025-01-17T10:25:25Z

Hey @tbooth, thanks a lot for following up on this PR, which I'm really sorry for nearly abandoning btw, some unexpected stuff came up last year (but I should finally have more time for this again). Many thanks for finding that bug!

Your implementation indeed is a whole lot more efficient than what I wrote initially and makes better use of the data structure. I'm going to add you as a contributor to my fork in a second, so you could push your code to this PR's branch directly, but it would also be totally fine from my side if you want to take over in a new PR.

As for the code, it looks good to me and I only have minor comments I left in your gist directly.

frankier · 2025-02-11T12:32:10Z

@leoschwarz This seems like a great improvement. Hopefully it'll be merged soon. One thing is you could run black so the formatting workflow passes: https://github.com/snakemake/snakemake/actions/runs/12832568045/job/36838022502?pr=3176#step:4:19

leoschwarz · 2025-02-11T12:47:15Z

@frankier thank you, but I think the formatting should be fixed on the main branch instead, to avoid merge conflicts, all PRs are currently failing because of it.

tbooth · 2025-02-12T12:15:52Z

CI has shown up a bug in this PR. I'm not sure if there is good general documentation for how to run the Snakemake tests but here is how to see the specific issue by running just this unit test on your local system:

Ensure you have a Linux box with a recent working Conda
Ensure your Conda config (~/.condarc) has these options:

channels:
  - conda-forge
  - bioconda
channel_priority: strict
solver: libmamba

(note, if you need to keep different settings in your personal ~/.condarc for some reason, you can make the new env, activate it, and put the above into $CONDA_PREFIX/.condarc so it will apply to just that env)

Checkout the right code (in this specific case I was looking at f72208a) and then

$ conda env create -n snakemake8_test -f test-environment.yml
$ conda activate snakemake8_test
$ pip install -e .

Now we are ready to test

$ pytest tests/tests.py::test_modules_all

This runs just the specific test that failed in the CI and reports the bug as seen in the GitHub CI logs. Looks like a pretty simple fix. I'm working on it.

$ pytest tests/tests.py::test_modules_all The prefix_lookup match_iter method must return both prefix keys and values

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

snakemake/common/prefix_lookup.py (1)

46-56: Enhance docstring with edge case examples.

The implementation is correct, but the docstring could be more helpful with examples.

Consider updating the docstring:

-    """Utility function that gives the common prefix of two strings.
-
-    Except, if the strings are identical, returns the string minus one letter.
-    Behaviour is undefined if either string is empty.
-    """
+    """Returns the common prefix of two strings.
+
+    Examples:
+        >>> _common_prefix("abc", "abd")
+        "ab"
+        >>> _common_prefix("abc", "abc")  # Special case: identical strings
+        "ab"
+
+    Note:
+        Behavior is undefined if either string is empty.
+    """

🧰 Tools

🪛 Ruff (0.8.2)

52-52: Loop control variable i not used within loop body

(B007)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f72208a and bd31687.

📒 Files selected for processing (3)

snakemake/common/prefix_lookup.py (1 hunks)
snakemake/output_index.py (1 hunks)
snakemake/path_modifier.py (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

snakemake/path_modifier.py

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.py`: Do not try to improve formatting. Do not suggest ...

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

snakemake/output_index.py
snakemake/common/prefix_lookup.py

🧠 Learnings (1)

snakemake/output_index.py (1)

Learnt from: leoschwarz
PR: snakemake/snakemake#3176
File: snakemake/output_index.py:20-28
Timestamp: 2024-11-12T11:27:07.572Z
Learning: In `snakemake/output_index.py`, the current matching logic in `OutputIndex.match` with binary search and backward iteration is acceptable, and suggestions to replace it with a simple iteration over all entries are not necessary.

🪛 Ruff (0.8.2)

snakemake/common/prefix_lookup.py

52-52: Loop control variable i not used within loop body

(B007)

🔇 Additional comments (6)

snakemake/output_index.py (3)

19-26: LGTM! Well-structured initialization logic.

The constructor efficiently processes rules and their products to create entries for PrefixLookup, with proper type annotations and clear data organization.

28-38: LGTM! Efficient and well-documented matching logic.

The function effectively combines prefix matching from PrefixLookup with suffix validation, and includes clear documentation about the need for additional verification using Rule.is_producer.

40-42: LGTM! Clean producer-specific matching implementation.

The function elegantly extends the base matching functionality by filtering for actual producers.

snakemake/common/prefix_lookup.py (3)

11-15: LGTM! Efficient initialization with proper grouping and sorting.

The constructor effectively organizes entries for optimal lookup performance.

17-19: LGTM! Clean and concise implementation.

The function provides a convenient set-based interface to the underlying iterator.

21-44: LGTM! Robust implementation with efficient search strategy.

The function effectively implements prefix matching using binary search, with proper handling of edge cases and clear documentation.

sonarqubecloud · 2025-03-05T09:52:31Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

snakemake/common/prefix_lookup.py (3)
17-19: Consider clarifying that duplicates are removed in the return value

The method returns a set, which automatically removes duplicates. This might be worth mentioning in the docstring for clarity.
-    def match(self, query: str) -> set[V]:
-        """Returns a set of all entry values which are attached to prefixes of the given key."""
+    def match(self, query: str) -> set[V]:
+        """Returns a set of all entry values which are attached to prefixes of the given key.
+        
+        Note: Duplicate values are automatically removed as this returns a set.
+        """
21-28: Docstring should clarify what's actually yielded

There's a discrepancy between the docstring and the implementation. The docstring states that it "Yields entries as (key, value) tuples" but the implementation in line 38 yields (k, entry) pairs.
-        """Yields all entries which are prefixes of the given key.
-
-        E.g. if "abc" is the key then "abc", "ab", "a", and "" are considered
-        valid prefixes.
-
-        Yields entries as (key, value) tuples as supplied to __init__()
+        """Yields all entries which are prefixes of the given key.
+
+        E.g. if "abc" is the key then "abc", "ab", "a", and "" are considered
+        valid prefixes.
+
+        Yields tuples of (prefix, value) where prefix is the matching prefix string
+        and value is the associated value from the original entries.
46-56: Handle edge cases in _common_prefix method

The current implementation doesn't handle the case where strings are identical, despite the docstring mentioning it should return the string minus one letter. Also, it doesn't explicitly handle empty strings.
     def _common_prefix(self, s1: str, s2: str) -> str:
         """Utility function that gives the common prefix of two strings.
 
         Except, if the strings are identical, returns the string minus one letter.
         Behaviour is undefined if either string is empty.
         """
+        # Handle identical strings case
+        if s1 == s2:
+            return s1[:-1] if s1 else ""
+            
         for i, (l1, l2) in enumerate(zip(s1, s2)):
             if l1 != l2:
                 break
 
         return s1[:i]
The static analysis warning about the loop variable i not being used within the loop body can be ignored, as it's correctly used after the loop.

🧰 Tools

🪛 Ruff (0.8.2)

52-52: Loop control variable i not used within loop body

(B007)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14fb5c4 and fc59e27.

📒 Files selected for processing (1)

snakemake/common/prefix_lookup.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.py`: Do not try to improve formatting. Do not suggest ...

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

snakemake/common/prefix_lookup.py

🪛 Ruff (0.8.2)

snakemake/common/prefix_lookup.py

52-52: Loop control variable i not used within loop body

(B007)

🔇 Additional comments (3)

snakemake/common/prefix_lookup.py (3)

7-9: Good implementation of a utility class for prefix matching

This PrefixLookup class provides a clean replacement for the datrie dependency mentioned in the PR objectives, with appropriate documentation of its purpose.

11-15: Well-designed initialization method

The implementation efficiently groups and sorts entries for later binary search operations. Using defaultdict here simplifies the code and makes it more maintainable.

29-44: Well-implemented binary search approach for prefix matching

The implementation efficiently uses binary search via the bisect module to find matching prefixes. The early break condition prevents infinite loops when dealing with empty strings or when common prefixes are exhausted.

johanneskoester

Thanks a lot, great work! @tbooth and @leoschwarz can you say something about the performance? Is this as fast as datrie for all realistic scenarios in practice?

tbooth · 2025-03-05T14:40:25Z

I think @leoschwarz may have some timing numbers on performance with some sample DAGs.

Profiling the code vs. datrie with cProfile it does come off a little slower, but we're talking 100ms for 10000 searches on a million-entry list. I can't imagine this would ever be noticeable in real usage. More importantly, I will reiterate that the search algorithm here is a binary search (with bisect) which is performed at most n times, where n is the number of valid prefixes being returned for that search term. Therefore there are no scaling issues, and no awkward corner cases where the performance will unexpectedly degrade.

johanneskoester · 2025-03-05T14:44:08Z

Time to merge then! We can always further optimize this at some later point. For now, this sounds good given that datrie is a bigger issue for obvious reasons.

johanneskoester · 2025-03-05T14:44:48Z

Thanks a lot @tbooth and @leoschwarz! This is really some great work!

leoschwarz · 2025-03-05T15:02:18Z

Hi, thank you for reviewing and merging @johanneskoester ! And thank you @tbooth for fixing and optimizing my original implementation in many ways, this was quite cool.
I agree with the reasoning on the performance.

As for what I've benchmarked, it is linked in #2970 with code in https://github.com/leoschwarz/benchmark_stuff/tree/main/2024/202411_snakemake_trie but comes with some caveats:

It was for an early version of this PR which had a logical bug, which Tim fixed. I wouldn't trust these old numbers, but it would require some adapting to give us numbers for the PR in its final form.
It is not general enough as I've simply simulated some workflows of varying sizes and checked how it performs.

I think it would be more useful to have a suite of snakemake workflows for which the performance evolution can be monitored continually, if this is desired. But it would need a bit more thought than what I quickly put together in that benchmark.

leoschwarz · 2025-03-06T07:43:22Z

Hi @johanneskoester , I just noticed that this didn't show up into the changelog for v8.29.1. I don't really mind but I suspect it happened because I wrote refactor: in the title rather than e.g. fix:, sorry for that.

…in-xrootd to 0.4.1 for Python >=3.11 - Update Snakemake to 9.1.0 for Python >=3.11 to remove the unmaintained datrie dependency, fixing installation issues with modern GCC. - Update snakemake-storage-plugin-xrootd to 0.4.1 for compatibility with Snakemake 9.x. - Keep Snakemake 7.32.4 and pulp for Python <3.11 for backward compatibility. - All tests pass locally (`186 passed, 4 xfailed`), matching CI results. References: - pytries/datrie#101 - pytries/datrie#99 - snakemake/snakemake#3176 - snakemake/snakemake#3105 - snakemake/snakemake#2970 - https://github.com/snakemake/snakemake/releases/tag/v8.29.1 Closes reanahub#494

@tbooth

This PR proposes to replace the datrie dependency, which is causing installation problems now (due to native compilation failing on some newer platforms). The logic which used tries before has been extracted into a class `PrefixLookup` which allows storing (string) key and (arbitrary) value pairs allowing multiple values for the same key. It provides functionality to obtain all values for which their corresponding key is a prefix of the specified query. This lookup is performed in `O(m log(n))` (thanks again @tbooth) where m is the number of matching results, and n is the length of the entries list being searched. When constructing the `PrefixLookup` inedx initially the cost depends on the number of unique keys, as they will be sorted. The PR also adds some unit tests for this functionality. Fixes snakemake#3105 and Fixes snakemake#2970 ### QC  * [x] The PR contains a test case for the changes or the changes are already covered by an existing test case. * [x] The documentation (`docs/`) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).  ## Summary by CodeRabbit ## Release Notes - **New Features** - Introduced a new `PrefixLookup` class for efficient prefix matching. - **Bug Fixes** - Removed the `datrie` dependency across various configuration files, simplifying installation requirements. - **Documentation** - Updated the FAQ to remove redundant phrases related to compiler errors. - **Tests** - Introduced comprehensive unit tests for the `OutputIndex` class, covering various matching scenarios and edge cases. - Added unit tests for the new `PrefixLookup` class to validate its functionality. - Updated tests to ensure robustness in matching functionalities under various conditions.  --------- Co-authored-by: Tim Booth <tim.booth@ed.ac.uk> Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de> Co-authored-by: Johannes Koester <johannes.koester@uni-due.de>

johanneskoester reviewed Oct 30, 2024

View reviewed changes

johanneskoester mentioned this pull request Nov 1, 2024

8.16.0: Cannot be installed because datrie no longer builds with GCC 14 #2970

Closed

leoschwarz marked this pull request as ready for review November 4, 2024 09:52

leoschwarz force-pushed the remove-datrie branch from 47bff49 to 8977ce3 Compare November 4, 2024 09:54

coderabbitai bot reviewed Nov 4, 2024

View reviewed changes

snakemake/output_index.py Show resolved Hide resolved

snakemake/path_modifier.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Nov 13, 2024

View reviewed changes

tests/test_output_index.py Show resolved Hide resolved

coderabbitai bot reviewed Nov 13, 2024

View reviewed changes

tests/test_prefix_lookup.py Show resolved Hide resolved

leoschwarz added 13 commits January 17, 2025 11:10

refactor(main): list based implementation of OutputIndex

95e3dee

path modifier

4bdbf5c

remove datrie

cd5ddd8

faster implementation

903423a

avoid list copy -> even faster

1b5dddf

optimized implementation

cfa1936

reformat

08476d6

remove todo

9a6241a

fix PathModifier code

8e09d60

correct implementation

01cdf49

optimized version

55499c0

format code

9d8812c

less confusing naming and add a hard test case

75996b6

Merge remote-tracking branch 'origin/main' into remove-datrie

f72208a

Fix bug revealed by unit test

bd31687

$ pytest tests/tests.py::test_modules_all The prefix_lookup match_iter method must return both prefix keys and values

coderabbitai bot reviewed Feb 12, 2025

View reviewed changes

leoschwarz and others added 3 commits February 24, 2025 14:52

Merge remote-tracking branch 'origin/main' into remove-datrie

0db50ed

Merge branch 'main' into remove-datrie

14fb5c4

fmt

fc59e27

coderabbitai bot reviewed Mar 5, 2025

View reviewed changes

johanneskoester approved these changes Mar 5, 2025

View reviewed changes

johanneskoester merged commit f863420 into snakemake:main Mar 5, 2025
49 checks passed

coderabbitai bot mentioned this pull request Mar 5, 2025

test: Tidying up the tests #3334

Merged

int-ptr-ptr mentioned this pull request Mar 9, 2025

Snakemake dependency datrie not plug-and-play PrincetonUniversity/SPECFEMPP#568

Closed

coderabbitai bot mentioned this pull request Mar 11, 2025

fix: clone source git repos for source files that come from github or gitlab #3263

Closed

2 tasks

coderabbitai bot mentioned this pull request Mar 20, 2025

feat: finalized support for access pattern annotation (sequential, random, multi), allowing storage plugins to decide about the most efficient provisioning approach (e.g. mounting vs. downloading a local copy) #3461

Merged

2 tasks

daxrpm mentioned this pull request Jun 2, 2025

Installation fails due to unmaintained datrie dependency in Snakemake on modern Linux reanahub/reana-commons#494

Open

daxrpm mentioned this pull request Jun 2, 2025

fix(snakemake): update Snakemake to 9.1.0 and snakemake-storage-plug… reanahub/reana-commons#495

Open

This was referenced Jul 24, 2025

fix: logging refinements #3571

Merged

fix: raise an error when different rules produce identical (temp) output #3667

Merged

coderabbitai bot mentioned this pull request Aug 29, 2025

fix: Confusion with Overriding input After Snakemake Modularization #3714

Merged

2 tasks

musicinmybrain mentioned this pull request Aug 30, 2025

Fix AlphaMap definition in cdatrie.pxd pytries/datrie#99

Merged

Conversation

leoschwarz commented Oct 30, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

QC

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

johanneskoester left a comment

Choose a reason for hiding this comment

Uh oh!

leoschwarz commented Nov 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johanneskoester commented Nov 1, 2024

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leoschwarz commented Nov 13, 2024

Uh oh!

sonarqubecloud bot commented Nov 13, 2024

Quality Gate passed

Uh oh!

leoschwarz commented Nov 13, 2024

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

f-koehler commented Jan 2, 2025

Uh oh!

tbooth commented Jan 16, 2025

Uh oh!

tbooth commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leoschwarz commented Jan 17, 2025

Uh oh!

frankier commented Feb 11, 2025

Uh oh!

leoschwarz commented Feb 11, 2025

Uh oh!

tbooth commented Feb 12, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Mar 5, 2025

Quality Gate passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

johanneskoester left a comment

Choose a reason for hiding this comment

Uh oh!

tbooth commented Mar 5, 2025

Uh oh!

johanneskoester commented Mar 5, 2025

Uh oh!

Uh oh!

johanneskoester commented Mar 5, 2025

Uh oh!

leoschwarz commented Oct 30, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 30, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

leoschwarz commented Nov 1, 2024 •

edited

Loading

tbooth commented Jan 16, 2025 •

edited

Loading