refactor: Remove datrie dependency#3176
Conversation
📝 WalkthroughWalkthroughThis pull request involves multiple changes across several files, primarily focusing on the removal of the Changes
Possibly related PRs
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
johanneskoester
left a comment
There was a problem hiding this comment.
Interesting! Could you compare the performance of this against the datrie version? I think the theoretical complexity is definitely worse, but the question is of course how much this matters in practice.
|
Hi, I will share my benchmark when I get to it (so far it looks like for small numbers of rules it can even be faster, but for roughly more than 1000 rules performance gets really bad). I still want to check with an alternative Python trie implementation before setting this PR as ready. As for benchmarking, I'm currently creating a number of paths according to some parameters like the number of shared prefixes and number of files per rule. However this might be far from real world cases. |
|
If there are cases where this is slower, it seems indeed worth to keep a trie implementation. As you mention, using other python trie packages is an option, but I would be also fine with trie-rs and using pyo3 to integrate it into Snakemake. The rationale is that I anyway want to rewrite time critical parts of Snakemake in rust at some point, so that depending on rust for package building is not a problem. |
47bff49 to
8977ce3
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (5)
tests/test_output_index.py (4)
8-64: Consider adding a helper function for creating rule mocks.The fixtures have repetitive mock setup code. Consider extracting a helper function to reduce duplication:
def create_rule_mock(mocker, products, name=None): mock_products = [ mocker.Mock(constant_prefix=lambda: prefix, constant_suffix=lambda: suffix) for prefix, suffix in products ] return mocker.Mock(products=lambda: mock_products, name=name)This would simplify the fixtures:
@pytest.fixture() def rule_0(mocker): return create_rule_mock( mocker, products=[ ("test", "-something.txt"), ("test", "-something.csv") ], name="rule_0" )
92-98: Remove debug print statement.The print statement on line 96 appears to be for debugging purposes and should be removed.
def test_match_with_empty_components( request, rule_empty_suffix, rule_empty_prefix, target, expected_rules ): output_index = OutputIndex([rule_empty_suffix, rule_empty_prefix]) - print(output_index._entries) expected = {request.getfixturevalue(rule) for rule in expected_rules} assert output_index.match(target) == expected
66-81: Add docstring to parameterized test.The test_match function would benefit from a docstring explaining the test cases:
@pytest.mark.parametrize( "target,expected_rules", [ ("test-{wildcard}-something.txt", {"rule_0"}), ("test-{wildcard}-something.csv", {"rule_0"}), ("test-something.txt", {"rule_0"}), ("other/file--something.csv", {"rule_1"}), ("other/file-something.csv", {"rule_1"}), ("other/not-file-something.csv", set()), ("other/file-dupe-something.csv", {"rule_1", "rule_2"}), ], ) def test_match(request, output_index, target, expected_rules): + """ + Test the OutputIndex.match method with various patterns: + - Patterns with wildcards + - Exact matches + - Non-matching patterns + - Multiple matching rules + """ expected = {request.getfixturevalue(rule) for rule in expected_rules} assert output_index.match(target) == expected
137-158: Add edge cases to parametrized tests.Consider adding more edge cases to test_parametrized_matches:
@pytest.mark.parametrize( "target,expected_match", [ ("test.txt", True), ("testing.txt", True), ("test.doc", False), ("other.txt", False), ("", False), ("test", False), ("bad.txt", False), + # Add edge cases + ("test" * 1000 + ".txt", True), # Very long filename + ("test\n.txt", False), # Invalid characters + ("test/.txt", False), # Path separator + ("TEST.TXT", False), # Case sensitivity ], )snakemake/path_modifier.py (1)
28-30: Ensure Consistent Sorting of Prefix ReplacementsThe use of
sorted(replace_prefix.items())correctly sorts the prefix-replacement pairs based on the prefix. However, since you're usingbisect, which requires the list to be sorted, ensure that the prefixes are consistently sorted according to the expected order, especially if prefixes can have varying lengths or characters.Consider specifying a sorting key for clarity:
self._prefix_replacements = ( - sorted(replace_prefix.items()) if replace_prefix else None + sorted(replace_prefix.items(), key=lambda item: item[0]) if replace_prefix else None )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (7)
doc-environment.yml(0 hunks)docs/project_info/faq.rst(1 hunks)setup.cfg(0 hunks)snakemake/output_index.py(1 hunks)snakemake/path_modifier.py(6 hunks)test-environment.yml(0 hunks)tests/test_output_index.py(1 hunks)
💤 Files with no reviewable changes (3)
- doc-environment.yml
- setup.cfg
- test-environment.yml
🧰 Additional context used
📓 Path-based instructions (3)
snakemake/output_index.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
snakemake/path_modifier.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
tests/test_output_index.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
🔇 Additional comments (8)
tests/test_output_index.py (3)
1-6: LGTM! Clean and appropriate imports.
The imports are minimal and correctly chosen for the testing requirements.
1-158: Overall approval: Well-structured test suite with good coverage.
The test file provides comprehensive coverage for the OutputIndex functionality, including various edge cases and scenarios. The suggested improvements are minor and the test suite is well-suited for validating the removal of the datrie dependency.
8-64: Add test fixtures for large rule sets.
Given the PR objectives mention performance concerns with >1000 rules, consider adding fixtures to test such scenarios:
Example fixture:
@pytest.fixture()
def large_ruleset(mocker):
# Create 1000+ rules to test performance
return [
create_rule_mock(
mocker,
products=[("prefix" + str(i), "suffix" + str(i))],
name=f"rule_{i}"
)
for i in range(1000)
]docs/project_info/faq.rst (1)
638-638: LGTM! Clear and accurate installation guidance.
The added text provides clear guidance about compiler errors during installation, correctly explaining that such errors must come from dependencies since Snakemake itself is pure Python. The recommendation to use Conda is particularly valuable as it helps users avoid compilation issues while gaining additional workflow benefits.
snakemake/output_index.py (2)
49-50: Approved: The match_producer method correctly filters matching rules
The match_producer method effectively filters the matched rules by checking if they are producers of the targetfile, ensuring accurate results.
29-48: Verify the correctness of the match method's logic
The current logic using bisect.bisect_right may not correctly find all matching prefixes, potentially leading to missed rule matches. This can occur because binary search operates on the assumption of total ordering, which may not align with prefix matching requirements.
Please run the following script to identify any targetfile inputs where the match method fails to find all valid prefixes:
This script simulates different prefixes and checks if the match method finds the corresponding rules. If mismatches are found, it indicates that the current implementation may not be correctly matching all prefixes.
snakemake/path_modifier.py (2)
147-147: Confirm Logic for Modifies Prefixes Property
The modifies_prefixes property now checks self._prefix_replacements instead of self.trie. This change aligns with the new implementation. Ensure that this property accurately reflects whether prefix modifications are active.
71-82: Verify Correctness of Prefix Matching Logic
The logic for matching prefixes using bisect is innovative. However, ensure that the search correctly identifies all possible matching prefixes without missing any due to sorting nuances.
Run the following script to test the prefix matching logic with various prefixes:
✅ Verification successful
Based on the code analysis, I can now provide a final response:
Prefix Matching Logic is Correctly Implemented
The implementation is sound for these reasons:
- The prefix replacements are pre-sorted during initialization
- The bisect_right approach correctly identifies the insertion point for the path
- The reverse iteration from (stop_idx - 1) ensures longest prefix matches are found first
- The early break condition when no match is found optimizes the search
- The implementation handles edge cases by checking for absolute paths, ".." paths, and other special cases before applying replacements
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the correctness of prefix matching logic.
# Test: Check the matched prefixes for a sample path.
# Expected: The script should output the correctly matched prefix.
python -c '
prefix_replacements = [("a/b", "x/y"), ("a", "x"), ("a/b/c", "x/y/z")]
path = "a/b/c/d/e"
from bisect import bisect_right
sorted_prefixes = sorted(prefix_replacements, key=lambda x: x[0])
stop_idx = bisect_right(sorted_prefixes, path, key=lambda x: x[0])
prefixes = []
for index in range(stop_idx - 1, -1, -1):
prefix = sorted_prefixes[index][0]
if path.startswith(prefix):
prefixes.append(prefix)
elif index != stop_idx - 1:
break
print("Matched prefixes:", prefixes)
'
Length of output: 587
Script:
#!/bin/bash
# Let's analyze the implementation details of the prefix matching logic in the codebase
# First, let's find where this prefix matching logic is used
rg -A 5 "_prefix_replacements"
# Also check for any tests related to this functionality
fd "test.*\.py" --exec rg -l "prefix.*replacement"
Length of output: 2179
Script:
#!/bin/bash
# Let's check the test files specifically for this class and its prefix matching functionality
rg -B 5 -A 10 "class PathModifier"
# Also look for any test cases that might exercise this functionality
rg -B 2 -A 5 "modify.*path" tests/
Length of output: 903
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (8)
snakemake/common/prefix_lookup.py (3)
7-10: Add class and constructor documentation.While the implementation is clean, adding docstrings would improve code maintainability:
- Document the class purpose and use cases
- Document constructor parameters and sorting behavior
Consider adding:
class PrefixLookup: + """A data structure for efficient prefix-based lookups. + + This class provides an alternative to trie-based implementations, optimized + for prefix matching operations using binary search. + """ def __init__(self, entries: Sequence[tuple[str, V]]) -> None: + """Initialize the prefix lookup structure. + + Args: + entries: Sequence of (string, value) pairs. Entries are sorted by + the string component for efficient prefix matching. + """ self._entries = sorted(entries, key=lambda x: x[0])
11-12: Add method documentation.While the implementation is correct, adding a docstring would clarify the return type behavior:
Consider adding:
def match(self, key: str) -> set[V]: + """Return a set of all values whose prefixes match the given key. + + Args: + key: The string to match prefixes against. + + Returns: + A set of values associated with matching prefixes, eliminating duplicates. + """ return set(self.match_iter(key))
1-28: Solid implementation that aligns with PR objectives.This implementation effectively replaces the datrie dependency with a maintainable solution using standard library components. The trade-offs are well-considered:
Advantages:
- No external dependencies
- More maintainable
- Memory efficient for sparse prefixes
- Good performance characteristics (O(log n) binary search)
Considerations:
- Different performance characteristics compared to trie (might be slower for large datasets)
- Memory usage patterns differ from trie structure
The implementation aligns well with the PR's goal of removing the datrie dependency while maintaining functionality.
snakemake/output_index.py (1)
17-18: Consider enhancing the docstring with implementation details.The docstring could be more informative by including details about:
- The underlying data structure (PrefixLookup)
- Performance characteristics with different rule set sizes
- Usage recommendations for large rule sets
- """Look up structure for rules, that can be queried by the output products which they create.""" + """Look up structure for rules, that can be queried by the output products which they create. + + This implementation uses a prefix-based lookup structure optimized for moderate-sized rule sets. + Performance characteristics may vary based on the number of rules and complexity of output patterns. + """tests/test_prefix_lookup.py (2)
4-39: LGTM with suggestions for additional test cases.The basic functionality tests are well-structured and cover important scenarios including basic matching, multiple matches, no matches, and empty lookup cases.
Consider adding tests for:
- Invalid input types (e.g., passing None, integers, or other non-string types to match())
- Very long strings to verify performance characteristics
Here's a suggested additional test:
def test_invalid_inputs(): lookup = PrefixLookup([("a", 1)]) with pytest.raises(TypeError): lookup.match(None) with pytest.raises(TypeError): lookup.match(42)
1-130: Document performance characteristics in docstring.Consider adding a module-level docstring that describes:
- The performance characteristics of the new implementation compared to datrie
- Any memory usage differences
- Known limitations or trade-offs
This will help users understand the implications of the change.
Example docstring:
"""Unit tests for PrefixLookup class. This implementation replaces the previous datrie-based solution with the following characteristics: - Time complexity: O(X) vs previous O(Y) - Memory usage: O(X) vs previous O(Y) - Known limitations: Performance degradation with >1000 prefixes For large-scale usage (>1000 prefixes), consider [alternative recommendations]. """tests/test_output_index.py (1)
8-64: Add docstrings to fixtures for better maintainability.The fixtures are well-structured and cover various test scenarios. However, adding docstrings would make their purpose and test coverage more explicit.
Example improvement:
@pytest.fixture() def rule_0(mocker): + """ + Creates a mock rule with two products: + 1. prefix='test', suffix='-something.txt' + 2. prefix='test', suffix='-something.csv' + Used to test multiple products with same prefix but different suffixes. + """ mock_products_rule_0 = [snakemake/path_modifier.py (1)
Line range hint
69-83: Consider explicit empty prefixes checkThe logic for handling prefix matches looks good, but there's a potential edge case in the code flow. When
prefixesis empty, the code will still reach theelif prefixes:block and try to accessprefixes[0].Consider this minor improvement for better clarity:
if len(prefixes) > 1: # ambiguous prefixes raise WorkflowError( "Multiple prefixes ({}) match the path {}. Make sure that the replace_prefix statement " "in your module definition does not yield ambiguous matches.".format( ", ".join(prefix[0] for prefix in prefixes), path ) ) -elif prefixes: +elif len(prefixes) == 1: # replace prefix prefix, replacement = prefixes[0] return replacement + path[len(prefix):] else: # no matching prefix return path
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
snakemake/common/prefix_lookup.py(1 hunks)snakemake/output_index.py(1 hunks)snakemake/path_modifier.py(6 hunks)tests/test_output_index.py(1 hunks)tests/test_prefix_lookup.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
snakemake/common/prefix_lookup.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
snakemake/output_index.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
snakemake/path_modifier.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
tests/test_output_index.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
tests/test_prefix_lookup.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
🔇 Additional comments (12)
snakemake/common/prefix_lookup.py (2)
1-5: LGTM! Clean and well-structured imports and type definitions.
The code demonstrates good use of type hints with a generic type parameter, enabling type-safe implementations for various value types.
14-28: Enhance documentation and add parameter validation.
The implementation is efficient with good optimizations. Consider these improvements:
- Add complexity information and parameter validation:
def match_iter(self, key: str) -> Generator[V, None, None]:
"""Returns all entries which are prefixes of the given key.
E.g. if "abc" is the key then "abc", "ab", "a", and "" are considered valid prefixes.
+
+ Args:
+ key: The string to match prefixes against.
+
+ Returns:
+ Generator yielding values associated with matching prefixes in
+ descending order of prefix length.
+
+ Time Complexity: O(log n) for binary search + O(k) for yielding k matches
+ Space Complexity: O(1) auxiliary space
+
+ Raises:
+ TypeError: If key is not a string
"""
+ if not isinstance(key, str):
+ raise TypeError("key must be a string")
+
stop_idx = bisect.bisect_right(self._entries, key, key=lambda x: x[0])- Consider adding a test to verify the early stopping optimization:
✅ Verification successful
Documentation improvements are good to have, but early stopping is well tested
The test file shows comprehensive test coverage including:
- Test for multiple matches (
test_multiple_matches) which implicitly verifies the early stopping behavior by checking correct results with nested prefixes - Edge cases with empty strings and whitespace (
test_edge_cases) - Various scenarios that would exercise the early stopping optimization:
- Overlapping prefixes (
test_overlapping_prefixes) - No matches (
test_no_matches) - Different string patterns and character types
- Overlapping prefixes (
The suggested documentation improvements would be beneficial for maintainability, but the implementation is already well-tested and robust.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify test coverage of early stopping optimization
# Check if there are tests verifying the early stopping behavior
rg -A 5 "test.*early.*stop" "tests/test_prefix_lookup.py" || \
rg -A 5 "def test.*prefix.*lookup" "tests/test_prefix_lookup.py"
Length of output: 122
Script:
#!/bin/bash
# Let's first check if the test file exists and see its content
fd "test_prefix_lookup.py" --type f --exec cat {}
Length of output: 3081
snakemake/output_index.py (4)
1-13: LGTM! Well-structured imports with proper type checking guards.
The imports are properly organized and follow best practices, with appropriate use of TYPE_CHECKING to prevent circular imports.
37-38: LGTM! Useful convenience method with clean implementation.
This method nicely encapsulates the common pattern of matching rules and filtering for actual producers.
28-35: LGTM! Clean implementation with clear documentation.
The matching logic is well-implemented and the docstring properly documents the two-phase matching approach (prefix/suffix followed by detailed verification).
✅ Verification successful
Let me check the test files to verify the implementation and performance aspects.
Implementation is well-tested and verified
The implementation is thoroughly tested with comprehensive test cases covering:
- Basic prefix/suffix matching
- Empty prefix/suffix scenarios
- Case sensitivity
- Edge cases and special characters
- Multiple matches and overlapping patterns
- Performance considerations through efficient prefix-based lookups
The two-phase matching approach (prefix/suffix followed by detailed verification) is properly implemented and well-documented. The test suite demonstrates that the implementation handles all expected scenarios correctly and efficiently.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Analyze the performance impact of the new implementation
# Look for existing performance tests or benchmarks
rg -l "benchmark|performance|test.*match" --type py
# Check for any existing performance-related issues
ast-grep --pattern 'def test_$_performance($_) { $$$ }'
Length of output: 517
Script:
#!/bin/bash
# Check test implementations for output_index and prefix_lookup
cat tests/test_output_index.py tests/test_prefix_lookup.py
# Look for any performance-related test cases or benchmarks specifically for matching
rg "def test.*match|performance.*match" tests/test_output_index.py tests/test_prefix_lookup.py -A 5
Length of output: 10424
19-26: Consider adding error handling for invalid products.
The constructor processes rule products without explicit error handling. While constant_prefix() and constant_suffix() likely handle edge cases internally, it would be good to verify this.
tests/test_prefix_lookup.py (2)
1-3: LGTM!
The import statement is clean and follows Python conventions.
41-130: Comprehensive test coverage with a note on performance verification.
The test suite thoroughly covers edge cases and special scenarios. However, given that this is replacing the datrie implementation:
- Consider adding performance benchmarks comparing the new implementation against datrie for:
- Large sets of prefixes (>1000 as mentioned in PR comments)
- Long input strings
- Memory usage patterns
- Verify compatibility with existing usage patterns in the codebase
Let's verify existing usage patterns:
✅ Verification successful
Test coverage aligns well with actual usage patterns
The test suite comprehensively covers all usage patterns found in the codebase:
output_index.pyuses PrefixLookup for rule/suffix matching with simple string prefixespath_modifier.pyuses it for prefix replacement with basic string keys- Both cases are well covered by the existing tests including:
- Basic prefix matching (test_overlapping_prefixes)
- Edge cases with empty strings (test_edge_cases)
- Special characters (test_special_characters)
- Different value types (test_different_value_types)
The implementation uses a simple sorted list with binary search, which is suitable for the observed usage patterns where prefix sets are relatively small and constructed once but queried multiple times.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for existing PrefixLookup usage patterns in the codebase
echo "Searching for existing PrefixLookup usage..."
rg -l "PrefixLookup" --type py
# Search for datrie usage to ensure complete migration
echo "Checking for remaining datrie usage..."
rg -l "datrie" --type py
Length of output: 356
Script:
#!/bin/bash
# Check actual usage patterns of PrefixLookup in the codebase
echo "=== PrefixLookup usage in output_index.py ==="
rg "PrefixLookup" -B 2 -A 2 snakemake/output_index.py
echo -e "\n=== PrefixLookup usage in path_modifier.py ==="
rg "PrefixLookup" -B 2 -A 2 snakemake/path_modifier.py
echo -e "\n=== PrefixLookup implementation ==="
cat snakemake/common/prefix_lookup.py
Length of output: 2175
tests/test_output_index.py (2)
1-7: LGTM! Clean and well-organized imports.
The imports are properly structured and include all necessary dependencies for the test suite.
66-98: LGTM! Comprehensive test coverage for pattern matching.
The parameterized tests effectively cover various matching scenarios including:
- Basic pattern matching
- Wildcard handling
- Empty component cases
- Edge cases
snakemake/path_modifier.py (2)
7-8: LGTM: Clean transition from datrie to PrefixLookup
The initialization and import changes effectively replace the datrie dependency with the new PrefixLookup implementation while maintaining proper encapsulation.
Also applies to: 28-28
134-135: LGTM: Clean property implementation
The property implementation is clear and the added return type annotation improves code clarity.
|
I've ended up creating a new class I will add one optimization that was present in an earlier version and then I think I hope it will finally be good to go. |
|
|
While I think it is correct now, this approach has a problem which is that in the worst case it will have to iterate through the whole list. I've added a test case to ensure I don't break this case. Lets say in the cache we have the prefixes:
And we want to know if I don't know how to address this to be honest, probably it means we can't do this efficiently without using a tree-like structure. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (5)
snakemake/common/prefix_lookup.py (3)
8-10: Enhance docstring with performance characteristics.Since this class replaces the datrie implementation, it would be helpful to document:
- Time complexity for initialization and lookups
- Space complexity
- Performance characteristics compared to the previous implementation
Example addition:
class PrefixLookup: - """A data structure for efficiently looking up entries by their prefix.""" + """A data structure for efficiently looking up entries by their prefix. + + Time Complexity: + - Initialization: O(n log n) where n is the number of entries + - Lookup: O(m log n) where m is the number of matching prefixes + Space Complexity: O(n) + + Note: This implementation uses Python's built-in data structures as an alternative + to the previous trie-based implementation. While it may have different performance + characteristics, it eliminates external dependencies. + """
11-15: Add input validation for empty strings and None values.While the implementation is correct, it would be more robust with input validation.
Consider adding validation:
def __init__(self, entries: Sequence[tuple[str, V]]) -> None: + if not entries: + self._entries = [] + return + + # Validate inputs + for key, _ in entries: + if not isinstance(key, str): + raise TypeError(f"Keys must be strings, got {type(key)}") + if key is None: + raise ValueError("Keys cannot be None") + grouped = defaultdict(list) for key, value in entries: grouped[key].append(value) self._entries = sorted(grouped.items(), key=lambda x: x[0])
21-36: Add detailed algorithm explanation in comments.The implementation is efficient with clever optimizations, but the algorithm's details could be better documented.
Consider adding more detailed comments:
def match_iter(self, query: str) -> Generator[V, None, None]: """Yields all entries which are prefixes of the given key. E.g. if "abc" is the key then "abc", "ab", "a", and "" are considered valid prefixes. + + Algorithm: + 1. Use binary search to find the insertion point for the query + 2. Iterate backwards from this point, as longer prefixes come later + 3. For each potential prefix: + - If it's a valid prefix, yield its entries + - Track the length of the last matched prefix + - Stop if we find a non-matching prefix shorter than last match + (as all remaining prefixes would be non-matching) + + This approach ensures O(m log n) complexity where: + - n is the total number of unique prefixes + - m is the number of matching prefixes """snakemake/output_index.py (1)
17-18: Consider enhancing the docstring.While the docstring explains the purpose, it could be enhanced with details about:
- The underlying data structure used (PrefixLookup)
- Performance characteristics
- Usage examples
tests/test_prefix_lookup.py (1)
4-28: Add docstrings to test functions for better documentation.The test implementations are well-structured and cover essential scenarios. Consider adding docstrings to explain the purpose of each test function, making it easier for other developers to understand the test coverage.
Example for
test_basic_match:def test_basic_match(): """ Test basic prefix matching functionality. Verifies that single prefix matches return the correct associated value. """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
snakemake/common/prefix_lookup.py(1 hunks)snakemake/output_index.py(1 hunks)snakemake/path_modifier.py(5 hunks)tests/test_prefix_lookup.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- snakemake/path_modifier.py
🧰 Additional context used
📓 Path-based instructions (3)
snakemake/common/prefix_lookup.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
snakemake/output_index.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
tests/test_prefix_lookup.py (1)
Pattern **/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.
🔇 Additional comments (10)
snakemake/common/prefix_lookup.py (3)
1-5: LGTM! Clean and appropriate imports.
The imports are well-chosen for the implementation, using standard library components for efficient data structures and type hints.
17-19: LGTM! Clean and efficient implementation.
The method effectively wraps match_iter and returns unique results using a set.
1-36: Verify performance characteristics with large datasets.
Given the PR objectives mention potential performance degradation with over 1000 rules, we should verify the implementation's behavior with large datasets.
✅ Verification successful
The previous results showed the usage of PrefixLookup in output_index.py and path_modifier.py, but we need more context about the actual implementation and usage patterns to make a proper assessment of the performance characteristics.
Based on the gathered information, I can now provide a final assessment of the performance characteristics:
Implementation is optimized for prefix matching at scale
The implementation demonstrates good performance characteristics for large datasets:
- Uses
bisectfor efficient O(log n) prefix search - Employs early termination optimization when finding non-matching longer prefixes
- Groups duplicate prefixes during initialization to reduce lookup overhead
- Implements iterator pattern (
match_iter) for memory-efficient processing - Real usage in
output_index.pyshows it's designed for rule matching, with additional filtering post-lookup
The comprehensive test suite, including edge cases and non-consecutive prefix scenarios, validates the correctness of these optimizations. The implementation is well-suited for handling large rule sets (>1000 rules) as it maintains logarithmic search complexity.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for test files that might contain performance tests
echo "Searching for existing performance tests..."
rg -l "performance|benchmark" tests/
# Look for actual usage patterns to understand typical dataset sizes
echo "Analyzing typical usage patterns..."
rg -A 5 "PrefixLookup\("
# Check if there are any TODO comments about performance
echo "Checking for performance-related TODOs..."
rg -i "todo.*performance|performance.*todo"
Length of output: 5437
Script:
#!/bin/bash
# Let's analyze the implementation details and usage patterns more thoroughly
# Check the test file for performance-critical test cases
echo "Analyzing test coverage..."
cat tests/test_prefix_lookup.py
# Look for actual rule processing to understand real-world usage
echo "Checking rule processing implementation..."
rg -A 10 "def match.*Rule" snakemake/output_index.py
# Check for any performance-related comments or documentation
echo "Checking documentation and comments..."
rg -i "performance|efficient|optimization" snakemake/common/prefix_lookup.py
Length of output: 4739
snakemake/output_index.py (4)
1-13: LGTM! Well-organized imports with proper type checking.
The imports are clean, following best practices with conditional type checking imports.
28-38: LGTM! Efficient implementation with clear documentation.
The match method effectively combines prefix lookup with suffix verification, and the docstring properly guides users about additional verification needed.
40-41: LGTM! Excellent addition of the convenience method.
The match_producer method provides a clean wrapper for the common use case of finding actual producers of a target file.
19-27: LGTM! Clean implementation of the new lookup structure.
The constructor efficiently processes rules and their products, creating a lookup structure without the datrie dependency. Let's verify the performance impact:
✅ Verification successful
Based on the comprehensive test coverage shown in test_output_index.py and the implementation analysis, I can now provide the final response:
Implementation verified as efficient and well-tested
The constructor implementation is verified to be robust and efficient through:
- Comprehensive test coverage for various scenarios including empty prefixes/suffixes, case sensitivity, and multiple rules
- Efficient single-pass processing of rules and their products
- Clean data structure usage with PrefixLookup
- Test cases demonstrate correct handling of scale with multiple rules (rule_0, rule_1, rule_2) and multiple products per rule
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Find and analyze test files to understand the scale of rules being processed
# Look for test files with rule definitions
echo "Test files with rule definitions:"
rg -l "class.*Rule" tests/
# Look for actual usage patterns
echo "\nTypical usage patterns:"
rg -A 5 "OutputIndex\("
Length of output: 2499
Script:
#!/bin/bash
# Let's analyze the test cases to understand the scale and patterns being tested
# Look at the test file content to understand test coverage
echo "Test file content:"
cat tests/test_output_index.py
# Look for Rule class definition to understand what products() returns
echo "\nRule class definition:"
rg -A 10 "class.*Rule.*:" snakemake/
Length of output: 27791
tests/test_prefix_lookup.py (3)
1-3: LGTM!
Clean import statement following Python conventions.
98-118: LGTM! Excellent test documentation.
The test case is well-documented with clear explanations of the matching process. The comments help readers understand the complex matching behavior, making it easier to maintain and debug.
1-118: Verify test coverage comparison with previous implementation.
While the test suite is comprehensive, we should ensure it provides comparable coverage to the previous trie-based implementation.
Let's analyze the test coverage:
✅ Verification successful
Test coverage is comprehensive and matches functionality requirements
After examining the implementation and tests, I can confirm that the test suite is thorough and covers all essential aspects of the PrefixLookup class:
-
Core functionality is well-tested:
- Basic prefix matching
- Multiple prefix matches
- Empty and no-match cases
- Overlapping prefixes
- Non-consecutive prefix matching
-
Edge cases and special scenarios are covered:
- Case sensitivity
- Special characters
- Unicode support
- Different value types
- Empty strings and whitespace
- Internal sorting behavior
-
Implementation-specific tests:
- The
test_prefix_matching_non_consecutivetest specifically verifies the correctness of the binary search and backtracking logic
- The
The test suite adequately verifies the new implementation's correctness and robustness, making it a suitable replacement for the datrie-based implementation.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for previous test files related to trie implementation
rg -l "trie|datrie" "tests/"
# Compare test patterns between old and new implementations
for f in $(rg -l "trie|datrie" "tests/"); do
echo "=== $f ==="
rg "def test_" "$f"
done
Length of output: 58
Script:
#!/bin/bash
# Let's try a different approach to find the old implementation and its tests
# First, let's check git history for files related to trie or datrie
git log -p --all | rg -l "trie|datrie"
# Also search for any deleted test files in git history
git log --diff-filter=D --summary | rg "delete.*test.*trie"
# Search for imports of datrie in the codebase
rg "import.*datrie"
# Look for any references to trie or datrie in the entire codebase
rg -i "trie|datrie" -A 2 -B 2
Length of output: 110
Script:
#!/bin/bash
# Let's check the current implementation of PrefixLookup
fd "prefix_lookup.py" --type f
# Once found, let's examine its content to understand the implementation
fd "prefix_lookup.py" --type f --exec cat {}
# Let's also check for any related documentation or implementation notes
rg -i "prefix.*lookup" -A 3 -B 3
Length of output: 12050
|
I just ran some quick test using this branch on an existing workflow I had and am happy to report that everything seems to work as expected. It would be amazing to see this merged and released in the near future. Installing Snakemake is broken on my system without removing this unmaintained dependency... |
|
I'm a big fan of eliminating dependencies like this, and as someone who also runs large workflows (100k+ jobs) on HPC I'm interested in this change and have been playing around with the new code. Seems there is a bug, which is revealed if you add an extra assertion to the This looks like it should be a redundant duplicate of the Which accords with the comment. To match, the starting value of Possibly this fix will speed up your performance tests as well? My interest here was to write my own performance benchmarks and to try and speed up the algorithm, but this issue definitely needs to be sorted before the PR is merged. |
|
OK, I think I nailed it :-) Rather than creating a PR for the PR, for now I've just put my modified code in a gist. I'm not sure if this way of sharing it is ideal but at least it's simple: https://gist.github.com/tbooth/1c9a300adf811f5bd3cd7df988ca6f54 @leoschwarz could you please review my code and let me know what you think? As far as I can see it passes all the tests (including my new ones for the bug above) and is highly efficient even in the face of your worst-case example, because we always bisect the list rather than scanning it. So I believe the efficiency is |
|
Hey @tbooth, thanks a lot for following up on this PR, which I'm really sorry for nearly abandoning btw, some unexpected stuff came up last year (but I should finally have more time for this again). Many thanks for finding that bug! Your implementation indeed is a whole lot more efficient than what I wrote initially and makes better use of the data structure. I'm going to add you as a contributor to my fork in a second, so you could push your code to this PR's branch directly, but it would also be totally fine from my side if you want to take over in a new PR. As for the code, it looks good to me and I only have minor comments I left in your gist directly. |
|
@leoschwarz This seems like a great improvement. Hopefully it'll be merged soon. One thing is you could run |
|
@frankier thank you, but I think the formatting should be fixed on the |
|
CI has shown up a bug in this PR. I'm not sure if there is good general documentation for how to run the Snakemake tests but here is how to see the specific issue by running just this unit test on your local system:
(note, if you need to keep different settings in your personal
This runs just the specific test that failed in the CI and reports the bug as seen in the GitHub CI logs. Looks like a pretty simple fix. I'm working on it. |
$ pytest tests/tests.py::test_modules_all The prefix_lookup match_iter method must return both prefix keys and values
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
snakemake/common/prefix_lookup.py (1)
46-56: Enhance docstring with edge case examples.The implementation is correct, but the docstring could be more helpful with examples.
Consider updating the docstring:
- """Utility function that gives the common prefix of two strings. - - Except, if the strings are identical, returns the string minus one letter. - Behaviour is undefined if either string is empty. - """ + """Returns the common prefix of two strings. + + Examples: + >>> _common_prefix("abc", "abd") + "ab" + >>> _common_prefix("abc", "abc") # Special case: identical strings + "ab" + + Note: + Behavior is undefined if either string is empty. + """🧰 Tools
🪛 Ruff (0.8.2)
52-52: Loop control variable
inot used within loop body(B007)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
snakemake/common/prefix_lookup.py(1 hunks)snakemake/output_index.py(1 hunks)snakemake/path_modifier.py(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- snakemake/path_modifier.py
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest ...
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
snakemake/output_index.pysnakemake/common/prefix_lookup.py
🧠 Learnings (1)
snakemake/output_index.py (1)
Learnt from: leoschwarz
PR: snakemake/snakemake#3176
File: snakemake/output_index.py:20-28
Timestamp: 2024-11-12T11:27:07.572Z
Learning: In `snakemake/output_index.py`, the current matching logic in `OutputIndex.match` with binary search and backward iteration is acceptable, and suggestions to replace it with a simple iteration over all entries are not necessary.
🪛 Ruff (0.8.2)
snakemake/common/prefix_lookup.py
52-52: Loop control variable i not used within loop body
(B007)
🔇 Additional comments (6)
snakemake/output_index.py (3)
19-26: LGTM! Well-structured initialization logic.The constructor efficiently processes rules and their products to create entries for PrefixLookup, with proper type annotations and clear data organization.
28-38: LGTM! Efficient and well-documented matching logic.The function effectively combines prefix matching from PrefixLookup with suffix validation, and includes clear documentation about the need for additional verification using
Rule.is_producer.
40-42: LGTM! Clean producer-specific matching implementation.The function elegantly extends the base matching functionality by filtering for actual producers.
snakemake/common/prefix_lookup.py (3)
11-15: LGTM! Efficient initialization with proper grouping and sorting.The constructor effectively organizes entries for optimal lookup performance.
17-19: LGTM! Clean and concise implementation.The function provides a convenient set-based interface to the underlying iterator.
21-44: LGTM! Robust implementation with efficient search strategy.The function effectively implements prefix matching using binary search, with proper handling of edge cases and clear documentation.
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
snakemake/common/prefix_lookup.py (3)
17-19: Consider clarifying that duplicates are removed in the return valueThe method returns a set, which automatically removes duplicates. This might be worth mentioning in the docstring for clarity.
- def match(self, query: str) -> set[V]: - """Returns a set of all entry values which are attached to prefixes of the given key.""" + def match(self, query: str) -> set[V]: + """Returns a set of all entry values which are attached to prefixes of the given key. + + Note: Duplicate values are automatically removed as this returns a set. + """
21-28: Docstring should clarify what's actually yieldedThere's a discrepancy between the docstring and the implementation. The docstring states that it "Yields entries as (key, value) tuples" but the implementation in line 38 yields
(k, entry)pairs.- """Yields all entries which are prefixes of the given key. - - E.g. if "abc" is the key then "abc", "ab", "a", and "" are considered - valid prefixes. - - Yields entries as (key, value) tuples as supplied to __init__() + """Yields all entries which are prefixes of the given key. + + E.g. if "abc" is the key then "abc", "ab", "a", and "" are considered + valid prefixes. + + Yields tuples of (prefix, value) where prefix is the matching prefix string + and value is the associated value from the original entries.
46-56: Handle edge cases in _common_prefix methodThe current implementation doesn't handle the case where strings are identical, despite the docstring mentioning it should return the string minus one letter. Also, it doesn't explicitly handle empty strings.
def _common_prefix(self, s1: str, s2: str) -> str: """Utility function that gives the common prefix of two strings. Except, if the strings are identical, returns the string minus one letter. Behaviour is undefined if either string is empty. """ + # Handle identical strings case + if s1 == s2: + return s1[:-1] if s1 else "" + for i, (l1, l2) in enumerate(zip(s1, s2)): if l1 != l2: break return s1[:i]The static analysis warning about the loop variable
inot being used within the loop body can be ignored, as it's correctly used after the loop.🧰 Tools
🪛 Ruff (0.8.2)
52-52: Loop control variable
inot used within loop body(B007)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
snakemake/common/prefix_lookup.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest ...
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
snakemake/common/prefix_lookup.py
🪛 Ruff (0.8.2)
snakemake/common/prefix_lookup.py
52-52: Loop control variable i not used within loop body
(B007)
🔇 Additional comments (3)
snakemake/common/prefix_lookup.py (3)
7-9: Good implementation of a utility class for prefix matchingThis
PrefixLookupclass provides a clean replacement for the datrie dependency mentioned in the PR objectives, with appropriate documentation of its purpose.
11-15: Well-designed initialization methodThe implementation efficiently groups and sorts entries for later binary search operations. Using
defaultdicthere simplifies the code and makes it more maintainable.
29-44: Well-implemented binary search approach for prefix matchingThe implementation efficiently uses binary search via the
bisectmodule to find matching prefixes. The early break condition prevents infinite loops when dealing with empty strings or when common prefixes are exhausted.
johanneskoester
left a comment
There was a problem hiding this comment.
Thanks a lot, great work! @tbooth and @leoschwarz can you say something about the performance? Is this as fast as datrie for all realistic scenarios in practice?
|
I think @leoschwarz may have some timing numbers on performance with some sample DAGs. Profiling the code vs. datrie with cProfile it does come off a little slower, but we're talking 100ms for 10000 searches on a million-entry list. I can't imagine this would ever be noticeable in real usage. More importantly, I will reiterate that the search algorithm here is a binary search (with bisect) which is performed at most n times, where n is the number of valid prefixes being returned for that search term. Therefore there are no scaling issues, and no awkward corner cases where the performance will unexpectedly degrade. |
|
Time to merge then! We can always further optimize this at some later point. For now, this sounds good given that datrie is a bigger issue for obvious reasons. |
|
Thanks a lot @tbooth and @leoschwarz! This is really some great work! |
|
Hi, thank you for reviewing and merging @johanneskoester ! And thank you @tbooth for fixing and optimizing my original implementation in many ways, this was quite cool. As for what I've benchmarked, it is linked in #2970 with code in https://github.com/leoschwarz/benchmark_stuff/tree/main/2024/202411_snakemake_trie but comes with some caveats:
I think it would be more useful to have a suite of snakemake workflows for which the performance evolution can be monitored continually, if this is desired. But it would need a bit more thought than what I quickly put together in that benchmark. |
|
Hi @johanneskoester , I just noticed that this didn't show up into the changelog for |
…in-xrootd to 0.4.1 for Python >=3.11 - Update Snakemake to 9.1.0 for Python >=3.11 to remove the unmaintained datrie dependency, fixing installation issues with modern GCC. - Update snakemake-storage-plugin-xrootd to 0.4.1 for compatibility with Snakemake 9.x. - Keep Snakemake 7.32.4 and pulp for Python <3.11 for backward compatibility. - All tests pass locally (`186 passed, 4 xfailed`), matching CI results. References: - pytries/datrie#101 - pytries/datrie#99 - snakemake/snakemake#3176 - snakemake/snakemake#3105 - snakemake/snakemake#2970 - https://github.com/snakemake/snakemake/releases/tag/v8.29.1 Closes reanahub#494
This PR proposes to replace the datrie dependency, which is causing installation problems now (due to native compilation failing on some newer platforms). The logic which used tries before has been extracted into a class `PrefixLookup` which allows storing (string) key and (arbitrary) value pairs allowing multiple values for the same key. It provides functionality to obtain all values for which their corresponding key is a prefix of the specified query. This lookup is performed in `O(m log(n))` (thanks again @tbooth) where m is the number of matching results, and n is the length of the entries list being searched. When constructing the `PrefixLookup` inedx initially the cost depends on the number of unique keys, as they will be sorted. The PR also adds some unit tests for this functionality. Fixes snakemake#3105 and Fixes snakemake#2970 ### QC <!-- Make sure that you can tick the boxes below. --> * [x] The PR contains a test case for the changes or the changes are already covered by an existing test case. * [x] The documentation (`docs/`) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake). <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - **New Features** - Introduced a new `PrefixLookup` class for efficient prefix matching. - **Bug Fixes** - Removed the `datrie` dependency across various configuration files, simplifying installation requirements. - **Documentation** - Updated the FAQ to remove redundant phrases related to compiler errors. - **Tests** - Introduced comprehensive unit tests for the `OutputIndex` class, covering various matching scenarios and edge cases. - Added unit tests for the new `PrefixLookup` class to validate its functionality. - Updated tests to ensure robustness in matching functionalities under various conditions. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Tim Booth <tim.booth@ed.ac.uk> Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de> Co-authored-by: Johannes Koester <johannes.koester@uni-due.de>



This PR proposes to replace the datrie dependency, which is causing installation problems now (due to native compilation failing on some newer platforms).
The logic which used tries before has been extracted into a class
PrefixLookupwhich allows storing (string) key and (arbitrary) value pairs allowing multiple values for the same key. It provides functionality to obtain all values for which their corresponding key is a prefix of the specified query.This lookup is performed in
O(m log(n))(thanks again @tbooth) where m is the number of matching results, and n is the length of the entries list being searched. When constructing thePrefixLookupinedx initially the cost depends on the number of unique keys, as they will be sorted.The PR also adds some unit tests for this functionality.
Fixes #3105 and
Fixes #2970
QC
docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).Summary by CodeRabbit
Release Notes
New Features
PrefixLookupclass for efficient prefix matching.Bug Fixes
datriedependency across various configuration files, simplifying installation requirements.Documentation
Tests
OutputIndexclass, covering various matching scenarios and edge cases.PrefixLookupclass to validate its functionality.