Revisit gather_statistics in read_parquet

Loading metadata from parquet files in the client can be costly; parsing the statistics and looking for ordering can also be costly.

- We should limit when we load metadata at all, either from a global file, otherwise from the first data file
- If the pandas metadata says there is an index, or the user specifies an index, we need to parse the statistics for that column only; do not try to guess an index
- While parsing, we can know early that ordering won't work, so bail early

As stated, this implies that any filters will be applied at read time, so we *don't* proactively cut down on partitions, but some partitions might come back empty (but the work is done on the worker).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revisit gather_statistics in read_parquet #6389

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Revisit gather_statistics in read_parquet #6389

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions