Skip to content

Implement statistics support for Substrait #8698

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

A report from Twitter https://twitter.com/mim_djo/status/1740542585410814393

Says:

a new release of #datafusion 34, still reading #Deltatable via arrow is suboptimal compared to reading Parquet Directly :( something to do with passing stats to get correct join orders.

image

I think the issue is that #7949 and #7950 rely on statistics to pick non bad join orders for TPCH queries.

These statistics are not available from the delta provider it seems.

@andygrove says

RelCommon (common to all operators in Substrait) can contain a hint that has stats

 message Stats {
      double row_count = 1;
      double record_size = 2;
      substrait.extensions.AdvancedExtension advanced_extension = 10;
    }

Describe the solution you'd like

I would like the Datafusion substrait consumer/producer to handle translating

Describe alternatives you've considered

No response

Additional context

This was brought up by @Dandandan on the ASF slack: https://the-asf.slack.com/archives/C04RJ0C85UZ/p1703885214702039

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestsubstraitChanges to the substrait crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions