Skip to content

Dask parquet metadata w/ ~2k files very slow #5272

@talebzeghmi

Description

@talebzeghmi

To reproduce:

  • Given an S3 directory of ~2k parquet files (~20mb) without a metadata summary file. In our case written by Spark.
  • Read it
dask_df = dataframe.read_parquet("s3://path/*parquet")

Result

It takes several minutes to read_metadata.

Suggested fix

read_parquet() documentation for gather_statistics:

gather_statistics : bool or None (default).
    Gather the statistics for each dataset partition. By default,
    this will only be done if the _metadata file is available. Otherwise,
    statistics will only be gathered if True, because the footer of
    every file will be parsed (which is very slow on some systems).

Arrow and FastParquet engines read each file even when gather_statistics is None and the metadata file doesn't exist.

Proposed fix is to change occurrences of gather_statistics is not False to gather_statistics is True to follow the intent: search of occurrences

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions