Change parquet with `engine="auto"` to default to `pyarrow`

Dask's `read_parquet` and `to_parquet` functions/methods both take an `engine` keyword argument specifying which parquet library to use. This currently supports:

- `"fastparquet"`
- `"pyarrow"`
- `"auto"`  (uses `fastparquet` if it is installed, otherwise falls back to `pyarrow`)

The reason for defaulting to `fastparquet` over `pyarrow` is historical - at the time `fastparquet` was more mature and supported more of the functionality we needed in dask.

I no longer belive this is the case, and would like to *propose switching the behavior of `auto` to default to `pyarrow`*.

A few reasons:

- pandas' `read_parquet` and `to_parquet` also takes an `engine` kwarg with the same options, but the `"auto"` option defaults to `pyarrow`, falling back to `fastparquet` (the opposite of dask). This difference is subtle and can be confusing to users.
- In our recent parquet benchmarking and resilience testing we generally found the `pyarrow` engine would scale to larger datasets better than the `fastparquet` engine, and more test cases would complete successfully when run with pyarrow than with fastparquet.
- The `pyarrow` library has a larger development team maintaining it and seems to have more community buy-in going forward. This isn't a great reason to pick one solution over another alone, but is something we should consider.

I think we can smoothly make this change with the following plan:

- Add a warning for a few releases if `engine="auto"` and both `fastparquet` and `pyarrow` are installed. This can be checked without importing `pyarrow` using `importlib.util.find_spec`. The warning will notify the user that the default behavior of `auto` will change in a future release, and they should be explicit about what engine they wish to use.
- After a few releases, change the default and remove the warning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change parquet with `engine="auto"` to default to `pyarrow` #8900

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Change parquet with engine="auto" to default to pyarrow #8900

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Change parquet with `engine="auto"` to default to `pyarrow` #8900