Skip to content

Consistency in collections API for partitions #1008

@koverholt

Description

@koverholt

I have a few points of feedback after working across different dask collections for benchmarks and examples.

Currently, dask collections have a partitioning argument (depending on the particular method being used) and an attribute to show the number of partitions:

Collection Specify partitions Query number of partitions
dask.array chunks numblocks
dask.bag npartitions/chunkbytes/partition_size npartitions
dask.dataframe npartitions/chunksize npartitions
  1. When moving between different from_* methods in arrays and bags/dataframes, I often pause to remember which argument is used to specify the number of partitions or chunk sizes. Thoughts on making the number of partitions a consistent argument (ideally, partitions=X or npartitions=X) between all collections? For arrays, I almost always find myself inverting for the chunk sizes to arrive at a particular number of partitions (especially when I'm using the distributed scheduler). It would be comforting to know that I can create a collection and always use a partitions or npartitions argument whether I'm using dask.bag.from_filenames or dask.array, any reason this might not be feasible for all collections?

  2. Assuming Blaze slicing #1 is a valid point, thoughts on making the argument to specify partitions the same as the attribute for querying the number of partitions? For dask.array, I need to remember chunks vs. numblocks.

  3. It might be handy to show the number of partitions in the repr for each collection, e.g.:

>>> x
dask.array<arange-..., shape=(100,), dtype=int64, chunksize=(2,), partitions=10>

>>> y
dd.Series<from-dask-arraycdefdd873cdf06131bcc8da67a0e65fb, divisions=(0, 2, 4, ..., 98, 99), partitions=10>

>>> b
<dask.bag.core.Bag at 0x11312e0f0, partitions=10>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions