-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
In our systems try to pass logical plan to datafusion with enable collect statics. The source table is from remote storage, sometimes it cost a few seconds to read parquet metadata to collect statics.
From log
datafusion::datasource::listing::table: Not hit cache infer_stats ObjectMeta { location: Path { raw: "working-dir/..-examples-test_case_data-reusemeta-metadata/ddltest/parquet/17d6373b-57ef-f370-34a6-1bd37d156a76/fa2ccb1e-5470-88a2-2fcb-b19779597e96/1/part-00000-f0bfae88-e929-4af1-99be-2599f2b51b3c-c000.snappy.parquet" }, last_modified: 2023-05-18T09:53:04.716427232Z, e_tag: None }, cost 1.5161s So i check the code see there is a cache called StatisticsCache construct here:
https://github.com/apache/arrow-datafusion/blob/abea8938b571a4aecddc7185b3acacadcc7dd854/datafusion/core/src/datasource/listing/table.rs#L656
It seems every time build a plan then insert an empty cache, only infer same file statistics in same plan can get benefit.
So I want to share the statics cache in session level 😄 to solve fetch remote file statistics not stable. I think many others query engine did this too.
Describe the solution you'd like
Add a cache manager to deal with all cache during the session lifetime.
https://github.com/apache/arrow-datafusion/blob/a38480951f40abce7ee2d5919251a1d1607f1dee/datafusion/execution/src/runtime_env.rs#L44-L50
Using the SessionState to pass cache result to each plan.
Describe alternatives you've considered
No response
Additional context
No response