-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
Right now, when a table is created via CREATE EXTERNAL TABLE, the underyling ListingTable reads and parses the statistics for each file. If the same table is queried, the cached statistics are re-read. However, if the same files are queried again, the statistics are re-read.
-- has to read remote storage and calculate statistics
select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
Elapsed 0.092 seconds.
-- re-calculates the same statistics
select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
Elapsed 0.092 seconds.You can see the statistics are recalculated with the timings 0.092s
If you use a CREATE EXTERNAL TABLE, which caches the statistics, it take sonly 0.025 seconds:
CREATE EXTERNAL TABLE hits stored as parquet location 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
> select count(*) from hits;
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
1 row(s) fetched.
Elapsed 0.029 seconds.
In addition, since a new cache is created for each table, the great function statistics_cache added by @nuno-faria in #19054 doesn't show anything:
> CREATE EXTERNAL TABLE hits stored as parquet location 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
0 row(s) fetched.
Elapsed 0.405 seconds.
> select * from statistics_cache();
+------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+
| path | file_modified | file_size_bytes | e_tag | version | num_rows | num_columns | table_size_bytes | statistics_size_bytes |
+------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+
+------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+
0 row(s) fetched.
Elapsed 0.015 seconds.Describe the solution you'd like
I would like a session scoped FileStatisticsCache that is shared between statements / ListingTables , the same way the DefaultFilesMetadataCache is: created:
datafusion/datafusion/execution/src/cache/cache_manager.rs
Lines 165 to 171 in 57c0dda
| let file_metadata_cache = config | |
| .file_metadata_cache | |
| .as_ref() | |
| .map(Arc::clone) | |
| .unwrap_or_else(|| { | |
| Arc::new(DefaultFilesMetadataCache::new(config.metadata_cache_limit)) | |
| }); |
The code for ListingTable somewhat unobviously sets a DefaultFileStatisticsCache here
https://github.com/apache/datafusion/blob/9f725d9c7064813cda0de0f87d115354b68d76e6/datafusion/catalog-listing/src/table.rs#L260-L259
Describe alternatives you've considered
No response
Additional context
We probably also need to add a limit, like this: