-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
The current implementation of the DefaultListFilesCache stores and retrieves entries from the cache using a provided Path as the key:
| fn get(&mut self, key: &Path) -> Option<Arc<Vec<ObjectMeta>>> { |
When using tables that have partitions, DataFusion will attempt to list files for a specific prefix if a user's query filters can be evaluated to exact, known partition values. E.g.
select * from my_table where a=1will use my_table/a=1/ as the Path if that partition exists.
In these scenarios, it's possible that the key for my_table with all of the files backing the table already exists in the cache, however the DefaultListFilesCache would not be able to fetch data for my_table/a=1/ because they keys would not match.
A cache miss in this scenario is undesirable for two reasons:
- DataFusion will execute a
Listrequest to backing storage to fetch a key that already exists in the cache - DataFusion will add
my_table/a=1/as a key to the cache, duplicating data in the cache
Describe the solution you'd like
I would like to enhance the DefaultListFilesCache to be "prefix aware" when attempting to fetch data.
The cache infrastructure currently allows a get_with_extra method:
| fn get_with_extra(&self, k: &Path, _e: &Self::Extra) -> Option<Arc<Vec<ObjectMeta>>> { |
I think it should be possible to define type Extra = Path (or perhaps Vec<PathPart>?) where the extra parameter could represent the prefix, and the standard key parameter can represent the base table_url. This should allow the DefaultListFilesCache to find and filter entries for a table to the requested path prefix.
Care would need to be taken to ensure that adding prefixed data to the cache does not return incomplete results for subsequent queries to the table.
Describe alternatives you've considered
It's possible using a different keying mechanism for the cache entirely could work as well. There's likely a potential solution that uses key: Vec<PathPart> or something similar and have the cache itself internalize the management of understanding when entries may or may not match. The difficulty here would likely be efficiently determining which parts of a path belong to a table vs a prefix.
Additional context
No response