Skip to content

Conversation

@Yuvraj-cyborg
Copy link
Contributor

Which issue does this PR close?

Closes #19273

Rationale for this change

DefaultListFilesCache currently uses the exact listing path as the cache key. When partition pruning narrows queries to specific partition prefixes (e.g., s3://bucket/table/year=2024/), the cache lookup fails even if a full table listing (s3://bucket/table/) was previously cached. This leads to redundant object store calls and duplicate cache entries for different partition filters on the same table.

What changes are included in this PR?

  • Updated ListFilesCache trait to use Extra = Option<Path> (partition prefix) instead of ObjectMeta
  • Added get_with_prefix(table_base, prefix, now) method to DefaultListFilesCache that:
    • Uses the table base path as a stable cache key
    • Optionally filters cached results by a partition prefix
    • Handles TTL expiration checks
  • Updated list_with_cache in url.rs to:
    • Always use the table base path as the cache key
    • Compute the relative prefix between the listing URL and table base
    • Always cache full table listings to ensure complete data is available for subsequent partition queries
    • Added compute_relative_prefix helper function

Are these changes tested?

Yes. Six dedicated unit tests validate prefix-aware cache behavior:

test_prefix_aware_cache_hit - filters cached results by prefix
test_prefix_aware_cache_no_filter_returns_all - returns all files when no prefix specified
test_prefix_aware_cache_miss_no_entry - handles cache misses correctly
test_prefix_aware_cache_no_matching_files - returns empty when no files match prefix
test_prefix_aware_nested_partitions - handles nested partition paths (e.g., year=2024/month=01/)
test_prefix_aware_different_tables - ensures different tables have isolated cache entries

Are there any user-facing changes?

No direct API changes. Users will see improved cache efficiency when querying partitioned tables - partition-pruned queries can now be served from cached full-table listings, reducing object store calls.

@github-actions github-actions bot added execution Related to the execution crate datasource Changes to the datasource crate labels Dec 12, 2025
@Yuvraj-cyborg
Copy link
Contributor Author

@BlakeOrth Could you check this, if this is the approach you were suggesting, table's base path is key and extra being the prefix ?

Copy link
Contributor

@BlakeOrth BlakeOrth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this! I do think this is generally what I had in mind, and I'm relieved to see the actual implementation didn't seem to have too many surprises. I reviewed this code carefully and have left a few comments.

Copy link
Contributor

@BlakeOrth BlakeOrth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this most recent set of changes. I think this is looking even better now!

@Yuvraj-cyborg
Copy link
Contributor Author

Yuvraj-cyborg commented Dec 13, 2025

@BlakeOrth I did the changes you asked but for the last one , but if we change Option<Path> to Option<&Path> , we need to change CacheAccessor trait's Extra type to be a reference and there can be complexity in lifetimes...

Again just an opinion
If you think that's a better architecture then would change it !

Guess I as well need to spend more time with the codebase .

@Yuvraj-cyborg
Copy link
Contributor Author

Hey @Jefffrey @BlakeOrth could you review this one !!

Copy link
Contributor

@BlakeOrth BlakeOrth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the short delay on the review, I think this is looking good to me now. Thank you! My approval doesn't have any power here so we'll still need some time from a maintainer.

cc @alamb

@BlakeOrth
Copy link
Contributor

but if we change Option<Path> to Option<&Path> , we need to change CacheAccessor trait's Extra type to be a reference and there can be complexity in lifetimes...

Yes, I thought this might end up happening. I don't think it's a big deal, but I do think it was worth at least exploring. Thanks for taking the time to look into it!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Yuvraj-cyborg and @BlakeOrth -- I took a look at this and it looks great to me

It would be nicer to reduce the verbosity of the testing harness, but I also think we could do it as a follow on PR

@Yuvraj-cyborg
Copy link
Contributor Author

@alamb I've made the changes you asked for !

}

/// Create a MockSession with a custom RuntimeEnv (for cache testing)
fn with_runtime_env(runtime_env: Arc<RuntimeEnv>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb added this pull request to the merge queue Dec 17, 2025
@alamb
Copy link
Contributor

alamb commented Dec 17, 2025

Thanks again @Yuvraj-cyborg

Merged via the queue into apache:main with commit 1e4bd75 Dec 17, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate execution Related to the execution crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make DefaultListFilesCache work better with prefixed paths

3 participants