Skip to content

Change parquet with engine="auto" to default to pyarrow #8900

@jcrist

Description

@jcrist

Dask's read_parquet and to_parquet functions/methods both take an engine keyword argument specifying which parquet library to use. This currently supports:

  • "fastparquet"
  • "pyarrow"
  • "auto" (uses fastparquet if it is installed, otherwise falls back to pyarrow)

The reason for defaulting to fastparquet over pyarrow is historical - at the time fastparquet was more mature and supported more of the functionality we needed in dask.

I no longer belive this is the case, and would like to propose switching the behavior of auto to default to pyarrow.

A few reasons:

  • pandas' read_parquet and to_parquet also takes an engine kwarg with the same options, but the "auto" option defaults to pyarrow, falling back to fastparquet (the opposite of dask). This difference is subtle and can be confusing to users.
  • In our recent parquet benchmarking and resilience testing we generally found the pyarrow engine would scale to larger datasets better than the fastparquet engine, and more test cases would complete successfully when run with pyarrow than with fastparquet.
  • The pyarrow library has a larger development team maintaining it and seems to have more community buy-in going forward. This isn't a great reason to pick one solution over another alone, but is something we should consider.

I think we can smoothly make this change with the following plan:

  • Add a warning for a few releases if engine="auto" and both fastparquet and pyarrow are installed. This can be checked without importing pyarrow using importlib.util.find_spec. The warning will notify the user that the default behavior of auto will change in a future release, and they should be explicit about what engine they wish to use.
  • After a few releases, change the default and remove the warning.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions