-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Our handling of parquet data is effective, but the code seems to be a bit of a scramble. It's not clear what features are handled and where they should be handled. I think that we should consider a rewrite. Looking through the code there are a few things in the design that I would like to change, but I'm not sure if they're there only due to history or if there is some reason why they were chosen.
I think that in principle a new reading backend should focus on the following two functions:
def get_metadata(
path: str,
filters: list,
) -> dictThis would return both ...
- a list of (
filename,row-group-number) pairs to pass on to the next function that satisfied the filters - a set of columns that could serve as a sorted index
and then also the following function:
def read_parquet_pieces(
file: FileLike,
partitions: List[int],
columns: List[str],
**kwargs: dict, # backend specific keyword arguments
) -> pandas.DataFrame:This function would be given a file-like object and a list of partitions within that object and would produce a single pandas-like DataFrame.
Other considerations like index placement, and whether or not to return a series, would not be handled by the backend, but would instead be handled by the backend-agnostic code. Options like whether or not to return categoricals would be left up to the backend and be handled by kwargs in read_parquet_pieces.
Am I missing anything here? Again, my objective in this issue is to understand if there are reasons for the complexity of the current system other than history (which is a fine reason, I don't mean to throw stones).