-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Feat: DefaultListFilesCache prefix-aware for partition pruning optimization #19298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@BlakeOrth Could you check this, if this is the approach you were suggesting, table's base path is key and extra being the prefix ? |
BlakeOrth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this! I do think this is generally what I had in mind, and I'm relieved to see the actual implementation didn't seem to have too many surprises. I reviewed this code carefully and have left a few comments.
47e1bcf to
ce365cc
Compare
BlakeOrth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this most recent set of changes. I think this is looking even better now!
ce365cc to
5614873
Compare
|
@BlakeOrth I did the changes you asked but for the last one , but if we change Again just an opinion Guess I as well need to spend more time with the codebase . |
|
Hey @Jefffrey @BlakeOrth could you review this one !! |
BlakeOrth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the short delay on the review, I think this is looking good to me now. Thank you! My approval doesn't have any power here so we'll still need some time from a maintainer.
cc @alamb
Yes, I thought this might end up happening. I don't think it's a big deal, but I do think it was worth at least exploring. Thanks for taking the time to look into it! |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Yuvraj-cyborg and @BlakeOrth -- I took a look at this and it looks great to me
It would be nicer to reduce the verbosity of the testing harness, but I also think we could do it as a follow on PR
5614873 to
2aa1685
Compare
…s, remove TTL from tests
2aa1685 to
a1a8ea0
Compare
|
@alamb I've made the changes you asked for ! |
| } | ||
|
|
||
| /// Create a MockSession with a custom RuntimeEnv (for cache testing) | ||
| fn with_runtime_env(runtime_env: Arc<RuntimeEnv>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
Thanks again @Yuvraj-cyborg |
Which issue does this PR close?
Closes #19273
Rationale for this change
DefaultListFilesCachecurrently uses the exact listing path as the cache key. When partition pruning narrows queries to specific partition prefixes (e.g., s3://bucket/table/year=2024/), the cache lookup fails even if a full table listing (s3://bucket/table/) was previously cached. This leads to redundant object store calls and duplicate cache entries for different partition filters on the same table.What changes are included in this PR?
ListFilesCachetrait to useExtra = Option<Path>(partition prefix) instead of ObjectMetaget_with_prefix(table_base, prefix, now) method toDefaultListFilesCachethat:list_with_cacheinurl.rsto:compute_relative_prefixhelper functionAre these changes tested?
Yes. Six dedicated unit tests validate prefix-aware cache behavior:
test_prefix_aware_cache_hit - filters cached results by prefix
test_prefix_aware_cache_no_filter_returns_all - returns all files when no prefix specified
test_prefix_aware_cache_miss_no_entry - handles cache misses correctly
test_prefix_aware_cache_no_matching_files - returns empty when no files match prefix
test_prefix_aware_nested_partitions - handles nested partition paths (e.g., year=2024/month=01/)
test_prefix_aware_different_tables - ensures different tables have isolated cache entries
Are there any user-facing changes?
No direct API changes. Users will see improved cache efficiency when querying partitioned tables - partition-pruned queries can now be served from cached full-table listings, reducing object store calls.