-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
edit by TomAugspurger
Currently partitions within a dask DataFrame do not known their own length. Anything using the length of the DataFrame or the partitions will need to compute it at runtime. The simplest case is len, which involves computing the graph and summing up the length of each partition
In [1]: import dask
In [8]: ts = dask.datasets.timeseries()
In [9]: len(ts)
Out[9]: 2592000
In [10]: ts.map_partitions(len).compute()
Out[10]:
0 86400
1 86400
2 86400
...
27 86400
28 86400
29 86400
dtype: int64Certain data storage formats (e.g. parquet) know the number of rows and could provide that to Dask, if there was a way to store it. This is somewhat complicated though (#5633 (comment)).
original post below
Following this SO question: https://stackoverflow.com/questions/59023793/why-is-computing-the-shape-on-an-indexed-parquet-file-so-slow-in-dask/59034239#59034239
It would be nice to compute the length of a dataframe from its metadata (if available) instead of iterating over all partitions.