-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat: Add the ability to review the contents of the Metadata Cache #17126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Merging up to main to try and get a clean CI run |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @nuno-faria -- this is pretty amazing
I didn't really find any comments worth making
I think we do need to add this function to the docs (I will push a commit to this PR to do so)
DataFusion CLI v49.0.0
> select * from metadata_cache();
+------+---------------+-----------------+-------+---------+---------------------+------+-------+
| path | file_modified | file_size_bytes | e_tag | version | metadata_size_bytes | hits | extra |
+------+---------------+-----------------+-------+---------+---------------------+------+-------+
+------+---------------+-----------------+-------+---------+---------------------+------+-------+
0 row(s) fetched.
Elapsed 0.012 seconds.
> create external table hits stored as parquet location 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
0 row(s) fetched.
Elapsed 3.623 seconds.
> select * from metadata_cache();
+----------------------------------------------------+---------------------+-----------------+---------------------------------------+---------+---------------------+------+------------------+
| path | file_modified | file_size_bytes | e_tag | version | metadata_size_bytes | hits | extra |
+----------------------------------------------------+---------------------+-----------------+---------------------------------------+---------+---------------------+------+------------------+
| hits_compatible/athena_partitioned/hits_61.parquet | 2022-07-03T15:40:34 | 117270944 | "5db11cad1ca0d80d748fc92c914b010a-6" | NULL | 212949 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_32.parquet | 2022-07-03T15:37:17 | 94506004 | "2f7db49a9fe242179590b615b94a39d2-5" | NULL | 278157 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_40.parquet | 2022-07-03T15:38:07 | 142508647 | "9e5852b45a469d5a05bf270a286eab8a-8" | NULL | 212917 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_93.parquet | 2022-07-03T15:44:07 | 127987774 | "751100bf0dac7d489b9836abf3108b99-7" | NULL | 278318 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_80.parquet | 2022-07-03T15:42:40 | 135714790 | "bceeea2c3642d594bccbd9c61eae87c4-7" | NULL | 214379 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_11.parquet | 2022-07-03T15:35:03 | 118419888 | "6875a9903c983367f77ec161244509d0-6" | NULL | 279667 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_77.parquet | 2022-07-03T15:42:18 | 214853191 | "49de48cc5ada10be63c82908e484d811-12" | NULL | 213571 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_20.parquet | 2022-07-03T15:36:02 | 85766533 | "0076ea5ecc5ddc85fdcf6893657640d0-4" | NULL | 213897 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_22.parquet | 2022-07-03T15:36:14 | 79775901 | "d620286f35855aec0a946cdc798c3742-4" | NULL | 213150 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_15.parquet | 2022-07-03T15:35:29 | 103098894 | "71fb3de3efd55e9a7b388ec6e9b796d7-5" | NULL | 279137 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_86.parquet | 2022-07-03T15:43:20 | 94791732 | "9cc7e64c9d72559bc566dc1a17c3e5b2-5" | NULL | 278313 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_19.parquet | 2022-07-03T15:35:54 | 103692598 | "4f96ce109a7cc4f33fe043983ac6778d-5" | NULL | 148569 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_2.parquet | 2022-07-03T15:34:04 | 230595491 | "6fb3dd8baa68a0cacd5330fdb021bc95-12" | NULL | 344551 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_43.parquet | 2022-07-03T15:38:25 | 299692947 | "b9566650354c86db5d5e582056f63ddd-16" | NULL | 278440 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_79.parquet | 2022-07-03T15:42:34 | 183779939 | "d49cfec843a49ba390f61848476ea6d5-10" | NULL | 278285 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_67.parquet | 2022-07-03T15:41:12 | 244748637 | "b8298ac683ebe7759fba857a55d0487a-13" | NULL | 344741 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_27.parquet | 2022-07-03T15:36:45 | 166286210 | "eafc2f587eb46b53f8392c8159bda44a-9" | NULL | 213047 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_50.parquet | 2022-07-03T15:39:14 | 245339079 | "fe72aacbb2eaec8d0ca1f367f0f2669c-13" | NULL | 279047 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_23.parquet | 2022-07-03T15:36:19 | 79631107 | "9f4df9af52602db8856bc3b77d42bf9e-4" | NULL | 148405 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_37.parquet | 2022-07-03T15:37:47 | 108247781 | "eb6dbd8380e9a35e3f1582363f380144-6" | NULL | 277956 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_13.parquet | 2022-07-03T15:35:16 | 146132022 | "a4b319a220d36c4a7afd5937fd57d7bd-8" | NULL | 213479 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_71.parquet | 2022-07-03T15:41:40 | 152098737 | "1f0a71f92e6bda321a6ce1d1a37a074f-8" | NULL | 278595 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_56.parquet | 2022-07-03T15:39:59 | 169373476 | "09c457b7655501621ba719289623d875-9" | NULL | 279031 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_82.parquet | 2022-07-03T15:42:54 | 144222581 | "4964583a34787eee74e59a0c2b4bee71-8" | NULL | 213079 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_65.parquet | 2022-07-03T15:40:59 | 231113014 | "0b8fbc587653908deb7ab876b3a6a011-13" | NULL | 279175 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_95.parquet | 2022-07-03T15:44:20 | 126081196 | "fd5fa352f6c25b4bf53412a853d1c7fb-7" | NULL | 278851 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_76.parquet | 2022-07-03T15:42:13 | 140789919 | "830b5d76686107fe86dbc36cff7b8e8c-7" | NULL | 278603 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_98.parquet | 2022-07-03T15:44:39 | 151992636 | "48a33cf1420c7d00df896779e1503344-8" | NULL | 278533 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_70.parquet | 2022-07-03T15:41:33 | 128851017 | "98e2d2b37cfb3cf0a6c9fdaced1e393a-7" | NULL | 148265 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_48.parquet | 2022-07-03T15:45:03 | 33680419 | "3c367e7309bfbb725394c734f167b566-1" | NULL | 147517 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_30.parquet | 2022-07-03T15:37:03 | 124187913 | "75daeb174d923f79f2a2305f5e708594-7" | NULL | 148035 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_69.parquet | 2022-07-03T15:41:26 | 154993686 | "6b5c8c13867d63bbd7c3b8b71b960783-8" | NULL | 213991 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_96.parquet | 2022-07-03T15:44:26 | 184175027 | "3d23cba352ce139a53723b82681b46a1-10" | NULL | 278688 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_81.parquet | 2022-07-03T15:42:48 | 130504343 | "9e570600d6285107050f1fb52d613383-7" | NULL | 213184 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_12.parquet | 2022-07-03T15:35:10 | 149514164 | "8158382c2c57876fe8149d15aa815250-8" | NULL | 213679 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_74.parquet | 2022-07-03T15:41:58 | 160976888 | "5938ea7a3488e33f3ca9c8f02f83fd4b-9" | NULL | 212911 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_60.parquet | 2022-07-03T15:40:27 | 143594342 | "bafedb3d11aff2bd7077ebde5bc27d0b-8" | NULL | 213597 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_84.parquet | 2022-07-03T15:43:07 | 148865810 | "f80b7b9a306c0102cb2a92f11c937a88-8" | NULL | 278646 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_36.parquet | 2022-07-03T15:37:42 | 92487304 | "0c4410bac653f3b37003f58de85b0fc8-5" | NULL | 212847 | 0 | page_index=false |
| hits_compatible/athena_partitioned/hits_64.parquet | 2022-07-03T15:40:51 | 106703134 | "609fbb627ab76206b09c3a557b711aa0-6" | NULL | 212854 | 0 | page_index=false |
| . |
| . |
| . |
+----------------------------------------------------+---------------------+-----------------+---------------------------------------+---------+---------------------+------+------------------+I also tested with the page index
> copy (select * from hits limit 1000000) to '/tmp/copy.parquet';
+---------+
| count |
+---------+
| 1000000 |
+---------+
1 row(s) fetched.
Elapsed 10.595 seconds.
> select count(*) from '/tmp/copy.parquet';
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
1 row(s) fetched.
Elapsed 0.007 seconds.
> select * from metadata_cache() where path LIKE '%tmp%';
+------------------+-------------------------+-----------------+-------------------------------+---------+---------------------+------+-----------------+
| path | file_modified | file_size_bytes | e_tag | version | metadata_size_bytes | hits | extra |
+------------------+-------------------------+-----------------+-------------------------------+---------+---------------------+------+-----------------+
| tmp/copy.parquet | 2025-08-11T19:35:45.467 | 88107710 | eed7a92-63c1c07a80906-5406abe | NULL | 960669 | 19 | page_index=true |
+------------------+-------------------------+-----------------+-------------------------------+---------+---------------------+------+-----------------+
1 row(s) fetched.
Elapsed 0.010 seconds.And it works great!
cc @shehabgamin and @jonathanc-n
| /// Updates the cache with a new memory limit in bytes. | ||
| fn update_cache_limit(&self, limit: usize); | ||
|
|
||
| /// Retrieves the information about the entries currently cached. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice -- this is a very nice public API
…usion into cache_metadata_info_api
| @@ -0,0 +1,142 @@ | |||
| <!--- | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the datafusion specific CLI functions to their own page
|
|
||
| [`page index`]: https://github.com/apache/parquet-format/blob/master/PageIndex.md | ||
|
|
||
| ## `metadata_cache` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb, the documentation looks good. Just a small note, the file_size_bytes, metadata_size_bytes, and hits have UInt64 type.
jonathanc-n
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, good and straightforward. thanks @nuno-faria
|
Thanks again @nuno-faria |

Which issue does this PR close?
Rationale for this change
Allowing the user to be able to check what metadata is currently cached improves the transparency of the caching, making it easier to debug any issues. This can be done by accessing
FileMetadataCache::list_entriesor directly indatafusion-clithrough themetadata_cachefunction. Example:What changes are included in this PR?
FileMetadataCacheto allow users to check the cache contents withlist_entries;FileMetadatato add the ability for implementers to provideextra_info.CachedParquetMetaDatato implementextra_info. In this case, it returns whether the cached metadata contains the page index.DefaultFilesMetadataCacheto keep track of cache hits and to implementlist_entries.metadata_cacheUDF indatafusion-cli.Are these changes tested?
Yes.
Are there any user-facing changes?
New default UDF in
datafusion-cli.