Skip to content

Conversation

@nuno-faria
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Allowing the user to be able to check what metadata is currently cached improves the transparency of the caching, making it easier to debug any issues. This can be done by accessing FileMetadataCache::list_entries or directly in datafusion-cli through the metadata_cache function. Example:

> select * from metadata_cache();
+------+---------------+-----------------+-------+---------+---------------------+------+-------+
| path | file_modified | file_size_bytes | e_tag | version | metadata_size_bytes | hits | extra |
+------+---------------+-----------------+-------+---------+---------------------+------+-------+
+------+---------------+-----------------+-------+---------+---------------------+------+-------+
> select count(*) from 't.parquet';
...

> select * from metadata_cache();
+---------------+-------------------------+-----------------+-------------------------+---------+---------------------+------+-----------------+
| path          | file_modified           | file_size_bytes | e_tag                   | version | metadata_size_bytes | hits | extra           |
+---------------+-------------------------+-----------------+-------------------------+---------+---------------------+------+-----------------+
| .../t.parquet | 2025-08-11T17:53:15.992 | 251365569       | 0-63c1a991e7d35-efb88c1 | NULL    | 5788659             | 1    | page_index=true |
+---------------+-------------------------+-----------------+-------------------------+---------+---------------------+------+-----------------+
> select count(*) from 't.parquet';
...

> select * from metadata_cache();
+---------------+-------------------------+-----------------+-------------------------+---------+---------------------+------+-----------------+
| path          | file_modified           | file_size_bytes | e_tag                   | version | metadata_size_bytes | hits | extra           |
+---------------+-------------------------+-----------------+-------------------------+---------+---------------------+------+-----------------+
| .../t.parquet | 2025-08-11T17:53:15.992 | 251365569       | 0-63c1a991e7d35-efb88c1 | NULL    | 5788659             | 3    | page_index=true |
+---------------+-------------------------+-----------------+-------------------------+---------+---------------------+------+-----------------+

What changes are included in this PR?

  • Updated FileMetadataCache to allow users to check the cache contents with list_entries;
  • Updated FileMetadata to add the ability for implementers to provide extra_info.
  • Updated CachedParquetMetaData to implement extra_info. In this case, it returns whether the cached metadata contains the page index.
  • Updated DefaultFilesMetadataCache to keep track of cache hits and to implement list_entries.
  • Added the metadata_cache UDF in datafusion-cli.
  • Added unit tests.

Are these changes tested?

Yes.

Are there any user-facing changes?

New default UDF in datafusion-cli.

@github-actions github-actions bot added execution Related to the execution crate datasource Changes to the datasource crate labels Aug 11, 2025
@alamb
Copy link
Contributor

alamb commented Aug 11, 2025

Merging up to main to try and get a clean CI run

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nuno-faria -- this is pretty amazing

I didn't really find any comments worth making

I think we do need to add this function to the docs (I will push a commit to this PR to do so)

DataFusion CLI v49.0.0
> select * from metadata_cache();
+------+---------------+-----------------+-------+---------+---------------------+------+-------+
| path | file_modified | file_size_bytes | e_tag | version | metadata_size_bytes | hits | extra |
+------+---------------+-----------------+-------+---------+---------------------+------+-------+
+------+---------------+-----------------+-------+---------+---------------------+------+-------+
0 row(s) fetched.
Elapsed 0.012 seconds.

> create external table hits stored as parquet location 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
0 row(s) fetched.
Elapsed 3.623 seconds.

> select * from metadata_cache();
+----------------------------------------------------+---------------------+-----------------+---------------------------------------+---------+---------------------+------+------------------+
| path                                               | file_modified       | file_size_bytes | e_tag                                 | version | metadata_size_bytes | hits | extra            |
+----------------------------------------------------+---------------------+-----------------+---------------------------------------+---------+---------------------+------+------------------+
| hits_compatible/athena_partitioned/hits_61.parquet | 2022-07-03T15:40:34 | 117270944       | "5db11cad1ca0d80d748fc92c914b010a-6"  | NULL    | 212949              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_32.parquet | 2022-07-03T15:37:17 | 94506004        | "2f7db49a9fe242179590b615b94a39d2-5"  | NULL    | 278157              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_40.parquet | 2022-07-03T15:38:07 | 142508647       | "9e5852b45a469d5a05bf270a286eab8a-8"  | NULL    | 212917              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_93.parquet | 2022-07-03T15:44:07 | 127987774       | "751100bf0dac7d489b9836abf3108b99-7"  | NULL    | 278318              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_80.parquet | 2022-07-03T15:42:40 | 135714790       | "bceeea2c3642d594bccbd9c61eae87c4-7"  | NULL    | 214379              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_11.parquet | 2022-07-03T15:35:03 | 118419888       | "6875a9903c983367f77ec161244509d0-6"  | NULL    | 279667              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_77.parquet | 2022-07-03T15:42:18 | 214853191       | "49de48cc5ada10be63c82908e484d811-12" | NULL    | 213571              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_20.parquet | 2022-07-03T15:36:02 | 85766533        | "0076ea5ecc5ddc85fdcf6893657640d0-4"  | NULL    | 213897              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_22.parquet | 2022-07-03T15:36:14 | 79775901        | "d620286f35855aec0a946cdc798c3742-4"  | NULL    | 213150              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_15.parquet | 2022-07-03T15:35:29 | 103098894       | "71fb3de3efd55e9a7b388ec6e9b796d7-5"  | NULL    | 279137              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_86.parquet | 2022-07-03T15:43:20 | 94791732        | "9cc7e64c9d72559bc566dc1a17c3e5b2-5"  | NULL    | 278313              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_19.parquet | 2022-07-03T15:35:54 | 103692598       | "4f96ce109a7cc4f33fe043983ac6778d-5"  | NULL    | 148569              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_2.parquet  | 2022-07-03T15:34:04 | 230595491       | "6fb3dd8baa68a0cacd5330fdb021bc95-12" | NULL    | 344551              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_43.parquet | 2022-07-03T15:38:25 | 299692947       | "b9566650354c86db5d5e582056f63ddd-16" | NULL    | 278440              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_79.parquet | 2022-07-03T15:42:34 | 183779939       | "d49cfec843a49ba390f61848476ea6d5-10" | NULL    | 278285              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_67.parquet | 2022-07-03T15:41:12 | 244748637       | "b8298ac683ebe7759fba857a55d0487a-13" | NULL    | 344741              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_27.parquet | 2022-07-03T15:36:45 | 166286210       | "eafc2f587eb46b53f8392c8159bda44a-9"  | NULL    | 213047              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_50.parquet | 2022-07-03T15:39:14 | 245339079       | "fe72aacbb2eaec8d0ca1f367f0f2669c-13" | NULL    | 279047              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_23.parquet | 2022-07-03T15:36:19 | 79631107        | "9f4df9af52602db8856bc3b77d42bf9e-4"  | NULL    | 148405              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_37.parquet | 2022-07-03T15:37:47 | 108247781       | "eb6dbd8380e9a35e3f1582363f380144-6"  | NULL    | 277956              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_13.parquet | 2022-07-03T15:35:16 | 146132022       | "a4b319a220d36c4a7afd5937fd57d7bd-8"  | NULL    | 213479              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_71.parquet | 2022-07-03T15:41:40 | 152098737       | "1f0a71f92e6bda321a6ce1d1a37a074f-8"  | NULL    | 278595              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_56.parquet | 2022-07-03T15:39:59 | 169373476       | "09c457b7655501621ba719289623d875-9"  | NULL    | 279031              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_82.parquet | 2022-07-03T15:42:54 | 144222581       | "4964583a34787eee74e59a0c2b4bee71-8"  | NULL    | 213079              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_65.parquet | 2022-07-03T15:40:59 | 231113014       | "0b8fbc587653908deb7ab876b3a6a011-13" | NULL    | 279175              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_95.parquet | 2022-07-03T15:44:20 | 126081196       | "fd5fa352f6c25b4bf53412a853d1c7fb-7"  | NULL    | 278851              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_76.parquet | 2022-07-03T15:42:13 | 140789919       | "830b5d76686107fe86dbc36cff7b8e8c-7"  | NULL    | 278603              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_98.parquet | 2022-07-03T15:44:39 | 151992636       | "48a33cf1420c7d00df896779e1503344-8"  | NULL    | 278533              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_70.parquet | 2022-07-03T15:41:33 | 128851017       | "98e2d2b37cfb3cf0a6c9fdaced1e393a-7"  | NULL    | 148265              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_48.parquet | 2022-07-03T15:45:03 | 33680419        | "3c367e7309bfbb725394c734f167b566-1"  | NULL    | 147517              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_30.parquet | 2022-07-03T15:37:03 | 124187913       | "75daeb174d923f79f2a2305f5e708594-7"  | NULL    | 148035              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_69.parquet | 2022-07-03T15:41:26 | 154993686       | "6b5c8c13867d63bbd7c3b8b71b960783-8"  | NULL    | 213991              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_96.parquet | 2022-07-03T15:44:26 | 184175027       | "3d23cba352ce139a53723b82681b46a1-10" | NULL    | 278688              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_81.parquet | 2022-07-03T15:42:48 | 130504343       | "9e570600d6285107050f1fb52d613383-7"  | NULL    | 213184              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_12.parquet | 2022-07-03T15:35:10 | 149514164       | "8158382c2c57876fe8149d15aa815250-8"  | NULL    | 213679              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_74.parquet | 2022-07-03T15:41:58 | 160976888       | "5938ea7a3488e33f3ca9c8f02f83fd4b-9"  | NULL    | 212911              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_60.parquet | 2022-07-03T15:40:27 | 143594342       | "bafedb3d11aff2bd7077ebde5bc27d0b-8"  | NULL    | 213597              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_84.parquet | 2022-07-03T15:43:07 | 148865810       | "f80b7b9a306c0102cb2a92f11c937a88-8"  | NULL    | 278646              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_36.parquet | 2022-07-03T15:37:42 | 92487304        | "0c4410bac653f3b37003f58de85b0fc8-5"  | NULL    | 212847              | 0    | page_index=false |
| hits_compatible/athena_partitioned/hits_64.parquet | 2022-07-03T15:40:51 | 106703134       | "609fbb627ab76206b09c3a557b711aa0-6"  | NULL    | 212854              | 0    | page_index=false |
| .                                                                                                                                                                                            |
| .                                                                                                                                                                                            |
| .                                                                                                                                                                                            |
+----------------------------------------------------+---------------------+-----------------+---------------------------------------+---------+---------------------+------+------------------+

I also tested with the page index

> copy (select * from hits limit 1000000) to '/tmp/copy.parquet';
+---------+
| count   |
+---------+
| 1000000 |
+---------+
1 row(s) fetched.
Elapsed 10.595 seconds.

> select count(*) from '/tmp/copy.parquet';
+----------+
| count(*) |
+----------+
| 1000000  |
+----------+
1 row(s) fetched.
Elapsed 0.007 seconds.

> select * from metadata_cache() where path LIKE '%tmp%';
+------------------+-------------------------+-----------------+-------------------------------+---------+---------------------+------+-----------------+
| path             | file_modified           | file_size_bytes | e_tag                         | version | metadata_size_bytes | hits | extra           |
+------------------+-------------------------+-----------------+-------------------------------+---------+---------------------+------+-----------------+
| tmp/copy.parquet | 2025-08-11T19:35:45.467 | 88107710        | eed7a92-63c1c07a80906-5406abe | NULL    | 960669              | 19   | page_index=true |
+------------------+-------------------------+-----------------+-------------------------------+---------+---------------------+------+-----------------+
1 row(s) fetched.
Elapsed 0.010 seconds.

And it works great!

cc @shehabgamin and @jonathanc-n

/// Updates the cache with a new memory limit in bytes.
fn update_cache_limit(&self, limit: usize);

/// Retrieves the information about the entries currently cached.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice -- this is a very nice public API

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Aug 11, 2025
@@ -0,0 +1,142 @@
<!---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the datafusion specific CLI functions to their own page


[`page index`]: https://github.com/apache/parquet-format/blob/master/PageIndex.md

## `metadata_cache`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the new documentation for metadata_cache

I think it looks pretty good rendered as well

Screenshot 2025-08-11 at 4 08 45 PM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb, the documentation looks good. Just a small note, the file_size_bytes, metadata_size_bytes, and hits have UInt64 type.

Copy link
Contributor

@jonathanc-n jonathanc-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, good and straightforward. thanks @nuno-faria

@alamb alamb merged commit 25ad99d into apache:main Aug 12, 2025
28 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 12, 2025

Thanks again @nuno-faria

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate documentation Improvements or additions to documentation execution Related to the execution crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Parquet Metadata Cache] Add an API to review the contents of the Cache

3 participants