Refactor parquet reader/writer

Our handling of parquet data is effective, but the code seems to be a bit of a scramble.  It's not clear what features are handled and where they should be handled.  I think that we should consider a rewrite.  Looking through the code there are a few things in the design that I would like to change, but I'm not sure if they're there only due to history or if there is some reason why they were chosen.  

I think that in principle a new reading backend should focus on the following two functions:

```python
def get_metadata(
    path: str,
    filters: list,
) -> dict
```

This would return both ...

-  a list of (`filename`, `row-group-number`) pairs to pass on to the next function that satisfied the filters
-  a set of columns that could serve as a sorted index

and then also the following function:

```python
def read_parquet_pieces(
    file: FileLike,
    partitions: List[int],
    columns: List[str],
    **kwargs: dict,  # backend specific keyword arguments
) -> pandas.DataFrame:
```

This function would be given a file-like object and a list of partitions within that object and would produce a single pandas-like DataFrame.  

Other considerations like index placement, and whether or not to return a series, would not be handled by the backend, but would instead be handled by the backend-agnostic code.  Options like whether or not to return categoricals would be left up to the backend and be handled by `kwargs` in `read_parquet_pieces`.

Am I missing anything here?  Again, my objective in this issue is to understand if there are reasons for the complexity of the current system other than history (which is a fine reason, I don't mean to throw stones).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor parquet reader/writer #4329

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Refactor parquet reader/writer #4329

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions