Skip to content

ddf.iloc part 1: add partition_sizes member (typically unset), testing utils#7396

Closed
ryan-williams wants to merge 20 commits intodask:mainfrom
celsiustx:p1
Closed

ddf.iloc part 1: add partition_sizes member (typically unset), testing utils#7396
ryan-williams wants to merge 20 commits intodask:mainfrom
celsiustx:p1

Conversation

@ryan-williams
Copy link
Contributor

@ryan-williams ryan-williams commented Mar 15, 2021

  • Closes #xxxx
  • Tests added / passed
  • Passes black dask / flake8 dask

This is the first substantive block of commits from #6661, broken out for possible incremental reviewing/merging.

This PR, like #6661, is based on a branch p0 that is a merge of other, functionally-unrelated outstanding PRs that were factored out of #6661 (currently: #7214, #7468).

It may be easiest to review the diff celsiustx/dask@p0...p1, which excludes the upstream/in-flight PRs #7214 and #7468.

This PR lays some core plumbing for what follows; subsequent "chunks" of #6661 are more about increasing coverage of partition_sizes-propagation through many specific helpers/transformations.

Overview

High-level changes in this PR (each with a discussion section below):

  • add partition_sizes member to _Frame constructor (and associated builders like new_dd_object)
  • testing utilities: more flexible assertion/verification of {divisions,partition_sizes} in assert_eq
  • proof-of-concept: populate partition_sizes in one concrete DataFrame-construction path, verify in tests
  • add TODOs where partition_sizes-propagation should be expanded

_Frame.partition_sizes

This PR adds a partition_sizes member to _Frame. It is None by default but can be a tuple of ints representing the sizes of each _Frame partition (if known).

_Frame.__init__ performs some validation/conversion to ensure that the size is consistent with that of any provided divisions, specifically:

  • partition_sizes should be one element shorter than divisions (since the latter includes start+end boundaries for each partition)
  • when non-None, partition_sizes is an immutable tuple

Plumbing is also added to various core methods to propagate partition_sizes through various transformations (__getstate__/__setstate__, __dask_postpersist__/_rebuild, new_dd_object).

map_partitions is also given a preserve_partition_sizes=False param which callers can set when it is known that partition_sizes should remain the same through a given transformation.

Testing: more flexible assertion/verification of {divisions,partition_sizes}

I also update testing helpers (notably assert_eq) to support verifying divisions and partition_sizes at different levels of strictness; from assert_eqs doc:

    """Verify that two _Frame-like objects are equal, subject to flexible matching constraints.
    `divisions` takes the following values:
    - <a tuple>: require that `a` and `b` both have `divisions` matching the given tuple
    - <a list of two tuples>: require that `a`'s (resp. `b`'s) `divisions` matches the first (resp. second) list
        element; `False` in either position means "skip this position".
    - "verify": check `divisions` for internal consistency; `compute` all partitions and verify that their bounds. This
        check is performed for all non-False-y values of `divisions`.
    - "dtype": verify the sides' divisions' dtypes match
    - "dtype?" (default, for backwards-compatibility): similar to "dtype", but only perform the check if `a` and `b`
        both have `divisions` attrs
    - "check" | True: verify that both sides' divisions match the other's (without checking their specific value)
    - "check?": similar to "check", but only applied if both sides have `divisions`
    - False (or anything "False-y"): perform no checks

    `partition_sizes` behaves just like `divisions`
    """

This is a lot of options, but the flexibility to specify exactly what is expected to match vs. not in a wide range of situations found in the code-base comes in handy, especially while building up to full iloc functionality from #6661 and existing in some half-implemented states. When some dust settles, I think we can reduce the number of cases, default to stricter checks, explicitly list both sides' divisions in the (less common) case where they are not expected to exactly match, etc.

Note that "dtype?" is the current/default verification level for divisions, which gives relatively weak guarantees (and allowed some bugs / wonky behavior to fall through the cracks, which I found as I started explicitly passing expected-value tuples to assert_eq in various tests).

partition_sizes, having no back-compat constraints, defaults to a stronger default of "check?", which:

  • verifies internal consistency: any non-None value actually matches the sizes of the underlying partitions
  • verifies that both sides' match each other (when they are both set)

The most common values I pass to assert_eq's divisions and partition_sizes params are:

  • a single tuple (or None): both sides must contain exactly this value
  • a list of 2 (tuple|None) values: [lhs,rhs] must exactly match the values in this list (1 side or the other will commonly be None, making explicit which transformations populate or lose divisions/partition_sizes)

partition_sizes always None (except in one proof-of-concept case + test)

As of this PR, the new partition_sizes member remains None in all cases except one: as a proof of concept, make_timeseries populates partition_sizes in its call to the DataFrame constructor. This is verified in the test_datasets_timeseries test, which exercises the more flexible assert_eq logic above to explicitly verify that:

    df = dask.datasets.timeseries(
        start="2000-01-01", end="2000-01-10", freq="1d"
    ).persist()
    df.to_parquet(tmp_path, engine=engine)

    df2 = dd.read_parquet(tmp_path, engine=engine)
    assert_eq(df, df2, partition_sizes=[(1, 1, 1, 1, 1, 1, 1, 1, 1), None])

TODOs for partition_sizes coverage

This PR also adds some # TODO: partition_sizes markers to calls (typically to new_dd_object) where a partition_sizes param should be explicitly computed+passed (even if it is expected to be None due to a transformation intrinsically losing partition-size info).

Most of these are resolved later in #6661.

@jsignell jsignell added the needs review Needs review from a contributor. label Feb 28, 2022
@jcrist
Copy link
Member

jcrist commented Mar 16, 2022

Hi @ryan-williams , thanks for this PR (and apologies for the super delayed response). We're not going to merge this as is, but the code here will be useful to reference in #5633, which we do plan to work on. Thanks for all your work here.

@jcrist jcrist closed this Mar 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

array dataframe io needs review Needs review from a contributor.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants