Add partition lengths to DataFrame metadata.

*edit by TomAugspurger*

Currently partitions within a dask DataFrame do not known their own length. Anything using the length of the DataFrame or the partitions will need to compute it at runtime. The simplest case is `len`, which involves computing the graph and summing up the length of each partition

```python
In [1]: import dask

In [8]: ts = dask.datasets.timeseries()

In [9]: len(ts)
Out[9]: 2592000

In [10]: ts.map_partitions(len).compute()
Out[10]:
0     86400
1     86400
2     86400
...
27    86400
28    86400
29    86400
dtype: int64
```

Certain data storage formats (e.g. parquet) know the number of rows and could provide that to Dask, if there was a way to store it. This is somewhat complicated though (https://github.com/dask/dask/issues/5633#issuecomment-558228192).

*original post below*

---

Following this SO question: https://stackoverflow.com/questions/59023793/why-is-computing-the-shape-on-an-indexed-parquet-file-so-slow-in-dask/59034239#59034239

It would be nice to compute the length of a dataframe from its metadata (if available) instead of iterating over all partitions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add partition lengths to DataFrame metadata. #5633

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add partition lengths to DataFrame metadata. #5633

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions