Skip to content

[Parquet Metadata Cache] Use the cached metadata for ListingTable statistics #17002

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

However, it doesn't seem to help certain queries that use statistcs. Specifically, I expect the second time the query is run it should do no network at all because the ParquetMetadata is already cached:

> set datafusion.execution.parquet.cache_metadata = true;
0 row(s) fetched.
Elapsed 0.000 seconds.

> select count(*) from 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 4.632 seconds.

> select count(*) from 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 2.717 seconds.

Describe the solution you'd like

I would like the queries above to go faster by using the ParquetMetaData cache

Describe alternatives you've considered

I think this is related to the fact that there is a separate path to retrieve statistics for ListingTable, specifically https://github.com/apache/datafusion/blob/1452333cf0933d4d8da032af68bc5a3a05c62483/datafusion/datasource-parquet/src/file_format.rs#L975-L974

So to fix this issue, I think what we need to do is to check the FileMetadataCache first before actually fetching any ParquetMetadata

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions