ddf.iloc part 1: add `partition_sizes` member (typically unset), testing utils by ryan-williams · Pull Request #7396 · dask/dask

ryan-williams · 2021-03-15T14:56:33Z

~~Closes #xxxx~~
Tests added / passed
Passes black dask / flake8 dask

This is the first substantive block of commits from #6661, broken out for possible incremental reviewing/merging.

This PR, like #6661, is based on a branch p0 that is a merge of other, functionally-unrelated outstanding PRs that were factored out of #6661 (currently: #7214, #7468).

It may be easiest to review the diff celsiustx/dask@p0...p1, which excludes the upstream/in-flight PRs #7214 and #7468.

This PR lays some core plumbing for what follows; subsequent "chunks" of #6661 are more about increasing coverage of partition_sizes-propagation through many specific helpers/transformations.

Overview

High-level changes in this PR (each with a discussion section below):

add partition_sizes member to _Frame constructor (and associated builders like new_dd_object)
testing utilities: more flexible assertion/verification of {divisions,partition_sizes} in assert_eq
proof-of-concept: populate partition_sizes in one concrete DataFrame-construction path, verify in tests
add TODOs where partition_sizes-propagation should be expanded

`_Frame.partition_sizes`

This PR adds a partition_sizes member to _Frame. It is None by default but can be a tuple of ints representing the sizes of each _Frame partition (if known).

_Frame.__init__ performs some validation/conversion to ensure that the size is consistent with that of any provided divisions, specifically:

partition_sizes should be one element shorter than divisions (since the latter includes start+end boundaries for each partition)
when non-None, partition_sizes is an immutable tuple

Plumbing is also added to various core methods to propagate partition_sizes through various transformations (__getstate__/__setstate__, __dask_postpersist__/_rebuild, new_dd_object).

map_partitions is also given a preserve_partition_sizes=False param which callers can set when it is known that partition_sizes should remain the same through a given transformation.

Testing: more flexible assertion/verification of {`divisions`,`partition_sizes`}

I also update testing helpers (notably assert_eq) to support verifying divisions and partition_sizes at different levels of strictness; from assert_eqs doc:

    """Verify that two _Frame-like objects are equal, subject to flexible matching constraints.
    `divisions` takes the following values:
    - <a tuple>: require that `a` and `b` both have `divisions` matching the given tuple
    - <a list of two tuples>: require that `a`'s (resp. `b`'s) `divisions` matches the first (resp. second) list
        element; `False` in either position means "skip this position".
    - "verify": check `divisions` for internal consistency; `compute` all partitions and verify that their bounds. This
        check is performed for all non-False-y values of `divisions`.
    - "dtype": verify the sides' divisions' dtypes match
    - "dtype?" (default, for backwards-compatibility): similar to "dtype", but only perform the check if `a` and `b`
        both have `divisions` attrs
    - "check" | True: verify that both sides' divisions match the other's (without checking their specific value)
    - "check?": similar to "check", but only applied if both sides have `divisions`
    - False (or anything "False-y"): perform no checks

    `partition_sizes` behaves just like `divisions`
    """

This is a lot of options, but the flexibility to specify exactly what is expected to match vs. not in a wide range of situations found in the code-base comes in handy, especially while building up to full iloc functionality from #6661 and existing in some half-implemented states. When some dust settles, I think we can reduce the number of cases, default to stricter checks, explicitly list both sides' divisions in the (less common) case where they are not expected to exactly match, etc.

Note that "dtype?" is the current/default verification level for divisions, which gives relatively weak guarantees (and allowed some bugs / wonky behavior to fall through the cracks, which I found as I started explicitly passing expected-value tuples to assert_eq in various tests).

partition_sizes, having no back-compat constraints, defaults to a stronger default of "check?", which:

verifies internal consistency: any non-None value actually matches the sizes of the underlying partitions
verifies that both sides' match each other (when they are both set)

The most common values I pass to assert_eq's divisions and partition_sizes params are:

a single tuple (or None): both sides must contain exactly this value
a list of 2 (tuple|None) values: [lhs,rhs] must exactly match the values in this list (1 side or the other will commonly be None, making explicit which transformations populate or lose divisions/partition_sizes)

`partition_sizes` always `None` (except in one proof-of-concept case + test)

As of this PR, the new partition_sizes member remains None in all cases except one: as a proof of concept, make_timeseries populates partition_sizes in its call to the DataFrame constructor. This is verified in the test_datasets_timeseries test, which exercises the more flexible assert_eq logic above to explicitly verify that:

partition_sizes is set by dask.datasets.timeseries (and propagated through a persist())
partition_sizes is lost by a round-trip through Parquet format (partition_sizes propagation through Parquet round-trips is added later in WIP: DataFrame.iloc implementation (backed by partition_sizes member) #6661)

    df = dask.datasets.timeseries(
        start="2000-01-01", end="2000-01-10", freq="1d"
    ).persist()
    df.to_parquet(tmp_path, engine=engine)

    df2 = dd.read_parquet(tmp_path, engine=engine)
    assert_eq(df, df2, partition_sizes=[(1, 1, 1, 1, 1, 1, 1, 1, 1), None])

TODOs for `partition_sizes` coverage

This PR also adds some # TODO: partition_sizes markers to calls (typically to new_dd_object) where a partition_sizes param should be explicitly computed+passed (even if it is expected to be None due to a transformation intrinsically losing partition-size info).

Most of these are resolved later in #6661.

jcrist · 2022-03-16T15:35:08Z

Hi @ryan-williams , thanks for this PR (and apologies for the super delayed response). We're not going to merge this as is, but the code here will be useful to reference in #5633, which we do plan to work on. Thanks for all your work here.

ryan-williams mentioned this pull request Mar 15, 2021

WIP: DataFrame.iloc implementation (backed by partition_sizes member) #6661

Closed

14 tasks

ryan-williams force-pushed the p1 branch 2 times, most recently from 18297f9 to 5a8e332 Compare March 16, 2021 20:30

ryan-williams force-pushed the p1 branch 2 times, most recently from 126ad27 to 8eb3c26 Compare March 26, 2021 14:18

ryan-williams force-pushed the p1 branch from 8eb3c26 to c156883 Compare April 13, 2021 23:54

github-actions bot added array dataframe io labels Apr 13, 2021

ryan-williams added 12 commits May 19, 2021 20:55

compute_and_set_divisions cleanup, preserve divisions dtype

b1930e2

drop+concat duplicate divisions in compute_and_set_sorted_divisions

c691ac1

test Array.A w/ np.matrix, spmatrix

f1e8685

np.matrix concatenate in compute()

067c656

test np.matrix blocks round trip

4f6b1ae

fix Array.A for non-ndarray blocks

6a7a9f2

let Array._meta be a coo_matrix

59c9644

more asserts around spmatrix concat behavior

d04516b

skip keepdims-squeezing for np.matrix, scipy spmatrix chunktypes

de16023

preserve spmatrix type through concatenate2

5f1091d

more spmatrix concat, stack, sum assertions

5b8b6e6

Merge branch 'divisions' into sum2

175b788

ryan-williams force-pushed the p1 branch from c156883 to be592d5 Compare May 20, 2021 01:05

ryan-williams added 8 commits May 19, 2021 21:35

partition_sizes: df ctors

821f200

partition-sizes: TODOs

6418a1d

attr: _args, __setstate__

7858345

attr: handle_out

f7bdede

new_dd_object: impl

c6e1ef3

maybe_infer_partition_sizes: impl

2126908

preserve_partition_sizes: impl

936cef0

improve divisions, partition-size checking

2229747

ryan-williams force-pushed the p1 branch from be592d5 to 2229747 Compare May 20, 2021 01:35

jsignell added the needs review Needs review from a contributor. label Feb 28, 2022

jcrist closed this Mar 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ddf.iloc part 1: add `partition_sizes` member (typically unset), testing utils#7396

ddf.iloc part 1: add `partition_sizes` member (typically unset), testing utils#7396
ryan-williams wants to merge 20 commits intodask:mainfrom
celsiustx:p1

ryan-williams commented Mar 15, 2021 •

edited

Loading

Uh oh!

jcrist commented Mar 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ryan-williams commented Mar 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

_Frame.partition_sizes

Testing: more flexible assertion/verification of {divisions,partition_sizes}

partition_sizes always None (except in one proof-of-concept case + test)

TODOs for partition_sizes coverage

Uh oh!

jcrist commented Mar 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ryan-williams commented Mar 15, 2021 •

edited

Loading

`_Frame.partition_sizes`

Testing: more flexible assertion/verification of {`divisions`,`partition_sizes`}

`partition_sizes` always `None` (except in one proof-of-concept case + test)

TODOs for `partition_sizes` coverage