-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
I have a few points of feedback after working across different dask collections for benchmarks and examples.
Currently, dask collections have a partitioning argument (depending on the particular method being used) and an attribute to show the number of partitions:
| Collection | Specify partitions | Query number of partitions |
|---|---|---|
| dask.array | chunks | numblocks |
| dask.bag | npartitions/chunkbytes/partition_size | npartitions |
| dask.dataframe | npartitions/chunksize | npartitions |
-
When moving between different
from_*methods in arrays and bags/dataframes, I often pause to remember which argument is used to specify the number of partitions or chunk sizes. Thoughts on making the number of partitions a consistent argument (ideally,partitions=Xornpartitions=X) between all collections? For arrays, I almost always find myself inverting for the chunk sizes to arrive at a particular number of partitions (especially when I'm using the distributed scheduler). It would be comforting to know that I can create a collection and always use apartitionsornpartitionsargument whether I'm usingdask.bag.from_filenamesordask.array, any reason this might not be feasible for all collections? -
Assuming Blaze slicing #1 is a valid point, thoughts on making the argument to specify partitions the same as the attribute for querying the number of partitions? For
dask.array, I need to rememberchunksvs.numblocks. -
It might be handy to show the number of partitions in the repr for each collection, e.g.:
>>> x
dask.array<arange-..., shape=(100,), dtype=int64, chunksize=(2,), partitions=10>
>>> y
dd.Series<from-dask-arraycdefdd873cdf06131bcc8da67a0e65fb, divisions=(0, 2, 4, ..., 98, 99), partitions=10>
>>> b
<dask.bag.core.Bag at 0x11312e0f0, partitions=10>