Skip to content

Make StatisticsCache share in session level #7556

@Ted-Jiang

Description

@Ted-Jiang

Is your feature request related to a problem or challenge?

In our systems try to pass logical plan to datafusion with enable collect statics. The source table is from remote storage, sometimes it cost a few seconds to read parquet metadata to collect statics.
From log

 datafusion::datasource::listing::table: Not hit cache infer_stats ObjectMeta { location: Path { raw: "working-dir/..-examples-test_case_data-reusemeta-metadata/ddltest/parquet/17d6373b-57ef-f370-34a6-1bd37d156a76/fa2ccb1e-5470-88a2-2fcb-b19779597e96/1/part-00000-f0bfae88-e929-4af1-99be-2599f2b51b3c-c000.snappy.parquet" }, last_modified: 2023-05-18T09:53:04.716427232Z, e_tag: None }, cost 1.5161s 

So i check the code see there is a cache called StatisticsCache construct here:
https://github.com/apache/arrow-datafusion/blob/abea8938b571a4aecddc7185b3acacadcc7dd854/datafusion/core/src/datasource/listing/table.rs#L656
It seems every time build a plan then insert an empty cache, only infer same file statistics in same plan can get benefit.

So I want to share the statics cache in session level 😄 to solve fetch remote file statistics not stable. I think many others query engine did this too.

Describe the solution you'd like

Add a cache manager to deal with all cache during the session lifetime.
https://github.com/apache/arrow-datafusion/blob/a38480951f40abce7ee2d5919251a1d1607f1dee/datafusion/execution/src/runtime_env.rs#L44-L50

Using the SessionState to pass cache result to each plan.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions