Skip to content

Harmonize split_every across modules #7283

@crusaderky

Description

@crusaderky

The split_every parameter is currently dishomogeneous across the various dask modules:

dask.array

  • domain: None or Integral ≥ 2 or mapping {axis: (Integral ≥ 2)}
  • If None, fall back to dask.config.get("split_every").
  • The key appears neither in dask/dask.yaml nor in dask/dask_schema.yaml.
  • If the key does not appear in the dask config, fall back to 4 (hardcoded).
  • False is interpreted the same as None.
  • float (e.g. 1e3) and np.float64 are rejected, incoherently with the shape and chunks parameters.
  • The docstring in dask.array.reductions.reduction is very clear about what split_every does - except that it appears nowhere in the rendered documentation; https://docs.dask.org/en/latest/array-api.html almost never mentions split_every due to the docs being auto-generated from numpy.
  • The same docstring says "Omit to let dask heuristically decide a good default". This is incorrect; there's no heuristic; just a hard default.

dask.dataframe

  • domain: None or False (which means no recursion) or int ≥ 2.
  • If None, fall back to 8 (hardcoded). The dask config is ignored.
  • float and np.float64 are rejected. npartitions accepts them, but chunksize doesn't.

dask.bag

Same as dask.dataframe, except that npartitions does not accept float / np.float64.

dask.graph_manipulation

(new in #7282)
Same as dask.dataframe, except that float and np.float64 are accepted by split_every and rounded down to the nearest int.

Proposed design

  • dask.array to interpret False as no recursion, coherently with the other modules
  • all modules to read the default from dask config, which will be set in dask/dask.yaml as either 4 or 8 (please discuss)
  • no hardcoded defaults (coherently with the design of dask.optimize)
  • dask/dask-schema.yaml to define the domain as False or int/float ≥ 2. floats will be rounded down to the nearest int. All modules to accept as functoin parameter None, False, or Number ≥ 2. Additionally, dask.array will accept, exclusively as a function parameter, {axis: (Number ≥ 2)}.
  • review sphinx documentation

Alternate design (not recommended)

  • deprecate the top-level split_every key in dask config
  • new config keys array.split_every, dataframe.split_every, bag.split_every and graph_manipulation.split_every which reflect the current mismatch in defaults and domain
  • no hardcoded defaults

Metadata

Metadata

Assignees

No one assigned

    Labels

    corediscussionDiscussing a topic with no specific actions yetneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions