-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Description
Dask's read_parquet and to_parquet functions/methods both take an engine keyword argument specifying which parquet library to use. This currently supports:
"fastparquet""pyarrow""auto"(usesfastparquetif it is installed, otherwise falls back topyarrow)
The reason for defaulting to fastparquet over pyarrow is historical - at the time fastparquet was more mature and supported more of the functionality we needed in dask.
I no longer belive this is the case, and would like to propose switching the behavior of auto to default to pyarrow.
A few reasons:
- pandas'
read_parquetandto_parquetalso takes anenginekwarg with the same options, but the"auto"option defaults topyarrow, falling back tofastparquet(the opposite of dask). This difference is subtle and can be confusing to users. - In our recent parquet benchmarking and resilience testing we generally found the
pyarrowengine would scale to larger datasets better than thefastparquetengine, and more test cases would complete successfully when run with pyarrow than with fastparquet. - The
pyarrowlibrary has a larger development team maintaining it and seems to have more community buy-in going forward. This isn't a great reason to pick one solution over another alone, but is something we should consider.
I think we can smoothly make this change with the following plan:
- Add a warning for a few releases if
engine="auto"and bothfastparquetandpyarroware installed. This can be checked without importingpyarrowusingimportlib.util.find_spec. The warning will notify the user that the default behavior ofautowill change in a future release, and they should be explicit about what engine they wish to use. - After a few releases, change the default and remove the warning.
Reactions are currently unavailable