ddf.iloc part 1: add partition_sizes member (typically unset), testing utils#7396
Closed
ryan-williams wants to merge 20 commits intodask:mainfrom
Closed
ddf.iloc part 1: add partition_sizes member (typically unset), testing utils#7396ryan-williams wants to merge 20 commits intodask:mainfrom
partition_sizes member (typically unset), testing utils#7396ryan-williams wants to merge 20 commits intodask:mainfrom
Conversation
14 tasks
18297f9 to
5a8e332
Compare
126ad27 to
8eb3c26
Compare
Member
|
Hi @ryan-williams , thanks for this PR (and apologies for the super delayed response). We're not going to merge this as is, but the code here will be useful to reference in #5633, which we do plan to work on. Thanks for all your work here. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #xxxxblack dask/flake8 daskThis is the first substantive block of commits from #6661, broken out for possible incremental reviewing/merging.
This PR, like #6661, is based on a branch
p0that is a merge of other, functionally-unrelated outstanding PRs that were factored out of #6661 (currently: #7214, #7468).It may be easiest to review the diff celsiustx/dask@p0...p1, which excludes the upstream/in-flight PRs #7214 and #7468.
This PR lays some core plumbing for what follows; subsequent "chunks" of #6661 are more about increasing coverage of
partition_sizes-propagation through many specific helpers/transformations.Overview
High-level changes in this PR (each with a discussion section below):
partition_sizesmember to_Frameconstructor (and associated builders likenew_dd_object)divisions,partition_sizes} inassert_eqpartition_sizesin one concrete DataFrame-construction path, verify in testspartition_sizes-propagation should be expanded_Frame.partition_sizesThis PR adds a
partition_sizesmember to_Frame. It isNoneby default but can be atupleofints representing the sizes of each_Framepartition (if known)._Frame.__init__performs some validation/conversion to ensure that the size is consistent with that of any provideddivisions, specifically:partition_sizesshould be one element shorter thandivisions(since the latter includes start+end boundaries for each partition)None,partition_sizesis an immutabletuplePlumbing is also added to various core methods to propagate
partition_sizesthrough various transformations (__getstate__/__setstate__,__dask_postpersist__/_rebuild,new_dd_object).map_partitionsis also given apreserve_partition_sizes=Falseparam which callers can set when it is known thatpartition_sizesshould remain the same through a given transformation.Testing: more flexible assertion/verification of {
divisions,partition_sizes}I also update testing helpers (notably
assert_eq) to support verifyingdivisionsandpartition_sizesat different levels of strictness; fromassert_eqs doc:This is a lot of options, but the flexibility to specify exactly what is expected to match vs. not in a wide range of situations found in the code-base comes in handy, especially while building up to full
ilocfunctionality from #6661 and existing in some half-implemented states. When some dust settles, I think we can reduce the number of cases, default to stricter checks, explicitly list both sides'divisionsin the (less common) case where they are not expected to exactly match, etc.Note that
"dtype?"is the current/default verification level fordivisions, which gives relatively weak guarantees (and allowed some bugs / wonky behavior to fall through the cracks, which I found as I started explicitly passing expected-value tuples toassert_eqin various tests).partition_sizes, having no back-compat constraints, defaults to a stronger default of"check?", which:Nonevalue actually matches the sizes of the underlying partitionsThe most common values I pass to
assert_eq'sdivisionsandpartition_sizesparams are:None): both sides must contain exactly this valueNone, making explicit which transformations populate or losedivisions/partition_sizes)partition_sizesalwaysNone(except in one proof-of-concept case + test)As of this PR, the new
partition_sizesmember remainsNonein all cases except one: as a proof of concept,make_timeseriespopulatespartition_sizesin its call to theDataFrameconstructor. This is verified in thetest_datasets_timeseriestest, which exercises the more flexibleassert_eqlogic above to explicitly verify that:partition_sizesis set bydask.datasets.timeseries(and propagated through apersist())partition_sizesis lost by a round-trip through Parquet format (partition_sizespropagation through Parquet round-trips is added later in WIP: DataFrame.iloc implementation (backed bypartition_sizesmember) #6661)TODOs for
partition_sizescoverageThis PR also adds some
# TODO: partition_sizesmarkers to calls (typically tonew_dd_object) where apartition_sizesparam should be explicitly computed+passed (even if it is expected to beNonedue to a transformation intrinsically losing partition-size info).Most of these are resolved later in #6661.