Skip to content

Refactor parquet reader/writer #4329

@mrocklin

Description

@mrocklin

Our handling of parquet data is effective, but the code seems to be a bit of a scramble. It's not clear what features are handled and where they should be handled. I think that we should consider a rewrite. Looking through the code there are a few things in the design that I would like to change, but I'm not sure if they're there only due to history or if there is some reason why they were chosen.

I think that in principle a new reading backend should focus on the following two functions:

def get_metadata(
    path: str,
    filters: list,
) -> dict

This would return both ...

  • a list of (filename, row-group-number) pairs to pass on to the next function that satisfied the filters
  • a set of columns that could serve as a sorted index

and then also the following function:

def read_parquet_pieces(
    file: FileLike,
    partitions: List[int],
    columns: List[str],
    **kwargs: dict,  # backend specific keyword arguments
) -> pandas.DataFrame:

This function would be given a file-like object and a list of partitions within that object and would produce a single pandas-like DataFrame.

Other considerations like index placement, and whether or not to return a series, would not be handled by the backend, but would instead be handled by the backend-agnostic code. Options like whether or not to return categoricals would be left up to the backend and be handled by kwargs in read_parquet_pieces.

Am I missing anything here? Again, my objective in this issue is to understand if there are reasons for the complexity of the current system other than history (which is a fine reason, I don't mean to throw stones).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions