Skip to content

Add a default FileStatisticsCache implementation for the ListingTable #19217

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

Right now, when a table is created via CREATE EXTERNAL TABLE, the underyling ListingTable reads and parses the statistics for each file. If the same table is queried, the cached statistics are re-read. However, if the same files are queried again, the statistics are re-read.

-- has to read remote storage and calculate statistics
select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
Elapsed 0.092 seconds.
-- re-calculates the same statistics
select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
Elapsed 0.092 seconds.

You can see the statistics are recalculated with the timings 0.092s

If you use a CREATE EXTERNAL TABLE, which caches the statistics, it take sonly 0.025 seconds:

CREATE EXTERNAL TABLE hits stored as parquet location 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
> select count(*) from hits;
+----------+
| count(*) |
+----------+
| 1000000  |
+----------+
1 row(s) fetched.
Elapsed 0.029 seconds.

In addition, since a new cache is created for each table, the great function statistics_cache added by @nuno-faria in #19054 doesn't show anything:

> CREATE EXTERNAL TABLE hits stored as parquet location 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
0 row(s) fetched.
Elapsed 0.405 seconds.

> select * from statistics_cache();
+------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+
| path | file_modified | file_size_bytes | e_tag | version | num_rows | num_columns | table_size_bytes | statistics_size_bytes |
+------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+
+------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+
0 row(s) fetched.
Elapsed 0.015 seconds.

Describe the solution you'd like

I would like a session scoped FileStatisticsCache that is shared between statements / ListingTables , the same way the DefaultFilesMetadataCache is: created:

let file_metadata_cache = config
.file_metadata_cache
.as_ref()
.map(Arc::clone)
.unwrap_or_else(|| {
Arc::new(DefaultFilesMetadataCache::new(config.metadata_cache_limit))
});

The code for ListingTable somewhat unobviously sets a DefaultFileStatisticsCache here
https://github.com/apache/datafusion/blob/9f725d9c7064813cda0de0f87d115354b68d76e6/datafusion/catalog-listing/src/table.rs#L260-L259

Describe alternatives you've considered

No response

Additional context

We probably also need to add a limit, like this:

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions