Skip to content

[Parquet Metadata Cache] Add an API to review the contents of the Cache #17091

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

We are adding a parquet metadata cache to ListingTable 🎉 (thanks @nuno-faria @jonathanc-n and @shehabgamin )

It turns out it is somewhat tricky to get right, and it is not always clear what is going on. Especially tricky is when the metadata is cached with page indexes, and sometimes without it, for example see this PR:

Describe the solution you'd like

I would like some way to see the contents of the cache with basic statistics

Describe alternatives you've considered

I suggest a twofold approach:

  1. Add APIs to the DefaultFileMetadataCache itself
  2. Add a function in datafusion-cli that uses those APIs to show the cache state

This two pronged approach would

  1. Help debug the working of the cache with datafusion-cli
  2. Ensure the APIs on the cache can be used to build useful introspection tools
  3. Offer an example of how to build such a thing for others

An example might look like

select * from parquet_metadata_cache()

And the output might look ike

path e_tag size_bytes page_index hits
/foo/bar 1234 t 12
/foo/baz xdef 3781 t 1
...

I think we could model its implementation on the parquet_metadata function: https://datafusion.apache.org/user-guide/cli/usage.html#parquet-metadata

pub struct ParquetMetadataFunc {}

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions