-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Description
To reproduce:
- Given an S3 directory of ~2k parquet files (~20mb) without a metadata summary file. In our case written by Spark.
- Read it
dask_df = dataframe.read_parquet("s3://path/*parquet")Result
It takes several minutes to read_metadata.
Suggested fix
read_parquet() documentation for gather_statistics:
gather_statistics : bool or None (default). Gather the statistics for each dataset partition. By default, this will only be done if the _metadata file is available. Otherwise, statistics will only be gathered if True, because the footer of every file will be parsed (which is very slow on some systems).
Arrow and FastParquet engines read each file even when gather_statistics is None and the metadata file doesn't exist.
Proposed fix is to change occurrences of gather_statistics is not False to gather_statistics is True to follow the intent: search of occurrences
Reactions are currently unavailable