Skip to content

Revisit gather_statistics in read_parquet #6389

@martindurant

Description

@martindurant

Loading metadata from parquet files in the client can be costly; parsing the statistics and looking for ordering can also be costly.

  • We should limit when we load metadata at all, either from a global file, otherwise from the first data file
  • If the pandas metadata says there is an index, or the user specifies an index, we need to parse the statistics for that column only; do not try to guess an index
  • While parsing, we can know early that ordering won't work, so bail early

As stated, this implies that any filters will be applied at read time, so we don't proactively cut down on partitions, but some partitions might come back empty (but the work is done on the worker).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions