Skip to content

Add partition lengths to DataFrame metadata. #5633

@hadim

Description

@hadim

edit by TomAugspurger

Currently partitions within a dask DataFrame do not known their own length. Anything using the length of the DataFrame or the partitions will need to compute it at runtime. The simplest case is len, which involves computing the graph and summing up the length of each partition

In [1]: import dask

In [8]: ts = dask.datasets.timeseries()

In [9]: len(ts)
Out[9]: 2592000

In [10]: ts.map_partitions(len).compute()
Out[10]:
0     86400
1     86400
2     86400
...
27    86400
28    86400
29    86400
dtype: int64

Certain data storage formats (e.g. parquet) know the number of rows and could provide that to Dask, if there was a way to store it. This is somewhat complicated though (#5633 (comment)).

original post below


Following this SO question: https://stackoverflow.com/questions/59023793/why-is-computing-the-shape-on-an-indexed-parquet-file-so-slow-in-dask/59034239#59034239

It would be nice to compute the length of a dataframe from its metadata (if available) instead of iterating over all partitions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions