-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
We are adding a parquet metadata cache to ListingTable 🎉 (thanks @nuno-faria @jonathanc-n and @shehabgamin )
It turns out it is somewhat tricky to get right, and it is not always clear what is going on. Especially tricky is when the metadata is cached with page indexes, and sometimes without it, for example see this PR:
Describe the solution you'd like
I would like some way to see the contents of the cache with basic statistics
Describe alternatives you've considered
I suggest a twofold approach:
- Add APIs to the
DefaultFileMetadataCacheitself - Add a function in
datafusion-clithat uses those APIs to show the cache state
This two pronged approach would
- Help debug the working of the cache with datafusion-cli
- Ensure the APIs on the cache can be used to build useful introspection tools
- Offer an example of how to build such a thing for others
An example might look like
select * from parquet_metadata_cache()And the output might look ike
| path | e_tag | size_bytes | page_index | hits |
|---|---|---|---|---|
| /foo/bar | 1234 | t | 12 | |
| /foo/baz | xdef | 3781 | t | 1 |
| ... |
I think we could model its implementation on the parquet_metadata function: https://datafusion.apache.org/user-guide/cli/usage.html#parquet-metadata
datafusion/datafusion-cli/src/functions.rs
Line 320 in 173989c
| pub struct ParquetMetadataFunc {} |
Additional context
No response