Ensure that divisions are alway tuples by charlesbluca · Pull Request #8393 · dask/dask

charlesbluca · 2021-11-17T19:13:36Z

This PR does the following:

Turns _Frame.divisions into a property wrapping _Frame._divisions, with a setter that casts the provided value into a tuple
Adds an assertion to assert_divisions to check that divisions is a tuple

cc @jsignell would you mind helping out with the docstring for the divisions property?

Closes Standardizing type for divisions #8388
Tests added / passed
Passes pre-commit run --all-files

This reverts commit b196e4a.

dask/dataframe/core.py

Co-authored-by: Julia Signell <jsignell@gmail.com>

dask/dataframe/core.py

gjoseph92 · 2021-11-17T21:27:09Z

dask/dataframe/core.py

+
+    @divisions.setter
+    def divisions(self, value):
+        self._divisions = tuple(value)


It could also be worth checking that this is sorted in increasing order? If you set divisions to something un-sorted, you'd probably get some weird errors down the line.

That seems like a sensible check - added this in along with a check that unknown divisions are being set properly (since we already need to check that None isn't in divisions before checking sortedness)

Following up here, we might want to make an ascending check a little more nuanced, as it looks like there are some situations where technically non-ascending divisions are assigned:

_________________________ test_set_index_categorical __________________________ [gw3] win32 -- Python 3.7.12 C:\Miniconda3\envs\test-environment\python.exe def test_set_index_categorical(): # https://github.com/dask/dask/issues/5671 order = list(reversed(string.ascii_letters)) values = list(string.ascii_letters) random.shuffle(values) dtype = pd.api.types.CategoricalDtype(order, ordered=True) df = pd.DataFrame({"A": pd.Categorical(values, dtype=dtype), "B": 1}) > result = dd.from_pandas(df, npartitions=2).set_index("A") dask\dataframe\tests\test_shuffle.py:908: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ dask\dataframe\core.py:4430: in set_index **kwargs, dask\dataframe\shuffle.py:206: in set_index df, index, divisions, shuffle=shuffle, drop=drop, compute=compute, **kwargs dask\dataframe\shuffle.py:325: in set_partition df4.divisions = methods.tolist(divisions) dask\dataframe\core.py:4194: in __setattr__ object.__setattr__(self, key, value) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = Dask DataFrame Structure: B npartitions=2 int64 ... ... Dask Name: set_index_post_scalar, 16 tasks value = ['Z', 'C', 'a'] @divisions.setter def divisions(self, value): if None in value: if any(v is not None for v in value): raise ValueError("divisions must be either all null or all non-null") elif any(a > b for a, b in zip(value, value[1:])): > raise ValueError("divisions must be sorted in ascending order") E ValueError: divisions must be sorted in ascending order dask\dataframe\core.py:390: ValueError

Alternatively this could be a place where we are setting divisions incorrectly that is now exposed because of the checks 😄

I have a feeling it's the latter.

After looking into this, think this is actually on us - this failure is coming up when we attempt to set the index to an ordered categorical column. Though the elements of divisions are strings, they represent categories with a different ordering.

One way I can imagine this could be resolved is if we checked if divisions is sorted when it's cast as a column/index with the same dtype as self.index? Or if we can access the ordering from the categorical dtype, we could make an override specifically for that index dtype.

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

dask/dataframe/core.py

gjoseph92 · 2021-11-18T17:11:14Z

dask/dataframe/core.py

+
+    @divisions.setter
+    def divisions(self, value):
+        self._divisions = tuple(value)


I have a feeling it's the latter.

gjoseph92 · 2021-11-18T17:11:50Z

dask/dataframe/core.py

+
+    @divisions.setter
+    def divisions(self, value):
+        if None in value:


Can we also check somewhere that len(value) == self.npartitions + 1?

Since npartitions is based off of the length of divisions, could we just check that len(value) == len(self._divisions)?

@jrbourbeau brought up it might be less surprising, and allow us to do a deprecation cycle, if instead of immediately coercing to tuple, we warn if divisions isn't a tuple. This will have the added benefit of snuffling out any places in the dask codebase where we are setting divisions with a list.

Yeah, specifically I'll propose we:

Emit a deprecation warning when the divisions setter encounters a non-tuple value that says something like "we will automatically convert this to a tuple in the future". This will let us inform users who are explicitly setting divisions today about this change. This may be overkill, but is in line with our desire to be better about informing users of behavior changes. This will also cause errors to be raised in our CI in all the places internally where divisions is a non-tuple today.

Change all the places internally in dask/dask where we're not using tuple divisions to be a tuple. This should result in CI passing again.

Open an issue about the deprecation and when we the deprecation cycle should end

That sounds good to me - in that case, would we also like to hold off on additional checks on the length / contexts of value until that deprecation cycle ends?

My sense is a deprecation cycle is still better than raising an error in that case. The goal would be to avoid downstream breakages without some kind of warning first.

Yeah, definitely in favor of the deprecation cycle here since there are known downstream libraries this would impact (dask-cudf and dask-sql in my case); I just prefer shifting from a DeprecationWarning to a ValueError in the future rather than implicitly converting to a tuple

Ah, I see. I misinterpreted your previous comment. Post-deprecation, raising an error (probably a TypeError in this case) seems like a reasonable thing to do -- no strong opinion from me

Following up on the original concern here (length check on the new divisions), it looks like there are several places in the codebase where a divisions with a different length will be provided as a means of repartitioning the dataframe. In cases like this, would we prefer instead using df.repartition(divisions=...)?

I can think of cases where this currently wouldn't work (for example, if we want to repartition using unknown divisions) that seem to motivate leaving out the length check for now, unless we want to address those areas in this PR

I see how that could be a chicken and the egg problem :), but I don't understand how setting divisions to a different length would work for repartitioning the dataframe. Do you have a specific example of where that happens?

dask/dataframe/core.py

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

jsignell

I was looking to see if there are any comments that should be adjusted and I noticed that there are a lot of other places where divisions are passed as kwargs. set_index and repartition come to mind. I am wondering if those should also only accept tuples. There is a check_division function that you could probably reuse at least.

jsignell · 2021-11-29T16:09:25Z

dask/dataframe/core.py

            )
        self._meta = meta
-        self.divisions = tuple(divisions)
+        self._divisions = tuple(divisions)


Do we want this coercion still?

Hmm good point removing this will probably expose several areas in the codebase where we are passing a list divisions to the constructor; happy to change this to something like

Suggested change

self._divisions = tuple(divisions)

self.divisions = divisions

IIRC that caused some issues with _divisions not being defined yet but that shouldn't be too difficult to resolve

If it's not an antipattern to use the setter, that is probably easiest yeah.

After working through some of the failures resulting from removing this coercion, having some second thoughts on removing it; this would essentially mean that all functions that implicitly set divisions (such as repartition) would need to pass tuple divisions to avoid breakage.

IMO, it would be less restrictive to users to allow iterable divisions for APIs that set them implicitly, and just silently coerce them to tuples internally; then the setter would only come into play when a user (or API) attempts to explicitly set divisions on a constructed frame.

Ok I can see how that would become a massive change. It is fine with me to start with this, and we can always change it later if we decide that it's preferable to be more explicit.

jsignell · 2021-11-29T16:11:12Z

dask/dataframe/core.py

+    def divisions(self, value):
+        if None in value:
+            if any(v is not None for v in value):
+                raise ValueError("`divisions` must be either all null or all non-null")


This should also be a warning not an Error for this first PR.

What warning should be used here? I considered PendingDeprecationWarning but to my knowledge divisions with mixed null / non-null values weren't supported before?

I don't have a strong feeling, FutureWarning or PendingDeprecationWarning is fine with me. There wasn't any check before so people could set whatever divisions they wanted right? Not to say that things would work after that, but they could always set the divisions.

This reverts commit b7d3368.

jsignell

This seems like it's in a good state, @charlesbluca is it good to merge from your perspective?

charlesbluca · 2021-12-20T20:53:52Z

@jsignell yeah I think it should be safe to merge this 🙂

- Turns `_Frame.divisions` into a property wrapping `_Frame._divisions`, with a setter that casts the provided value into a tuple - Adds an assertion to `assert_divisions` to check that `divisions` is a tuple Co-authored-by: Julia Signell <jsignell@gmail.com> Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

charlesbluca added 2 commits November 17, 2021 10:43

Use property/setter methods to ensure divisions is tuple

933d7cc

Assert divisions type in assert_divisions

e5f9a44

charlesbluca marked this pull request as draft November 17, 2021 19:13

github-actions bot added the dataframe label Nov 17, 2021

Revert "Ensure single-partition join divisions is tuple (dask#8389)"

df3c234

This reverts commit b196e4a.

jsignell reviewed Nov 17, 2021

View reviewed changes

dask/dataframe/core.py Show resolved Hide resolved

charlesbluca and others added 2 commits November 17, 2021 15:34

Add docstring for divisions property

bbe2b89

Co-authored-by: Julia Signell <jsignell@gmail.com>

Run pre-commit hooks

d698dc9

gjoseph92 reviewed Nov 17, 2021

View reviewed changes

charlesbluca and others added 4 commits November 18, 2021 10:19

Update divisions docstring

9c15082

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

Minor doc correction

3ab1682

Set scalar divisions to be tuple

a468e2c

Add sorting / null checks for divisions assignment

f17e44e

douglasdavis reviewed Nov 18, 2021

View reviewed changes

dask/dataframe/core.py Show resolved Hide resolved

gjoseph92 reviewed Nov 18, 2021

View reviewed changes

dask/dataframe/core.py Outdated Show resolved Hide resolved

charlesbluca and others added 8 commits November 18, 2021 13:43

Correct divisions example

7538bfe

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

Add divisions length check

af2d2c5

Add deprecation warning for non-tuple divisions

71e7a2d

Negate tuple check

b32ebcf

First pass at resolving test failures

4488e3b

More test resolution

b4827db

Remove divisions len check

ef21e14

Switch CI to use tuple divisions for remaining failures

b47b382

charlesbluca marked this pull request as ready for review November 29, 2021 14:56

jsignell reviewed Nov 29, 2021

View reviewed changes

charlesbluca added 2 commits November 29, 2021 11:58

Always use divisions setter in constructors

b7d3368

Change value error to warning

7f9ac18

This was referenced Dec 1, 2021

When divisions has repeats, set_index puts all data in the last partition instead of balancing it #8437

Open

Optimize groupby when by contains ddf.index #8442

Open

charlesbluca added 2 commits December 7, 2021 12:26

Revert "Always use divisions setter in constructors"

9d56374

This reverts commit b7d3368.

Merge remote-tracking branch 'upstream/main' into ensure-tuple-divisions

663c236

jsignell approved these changes Dec 20, 2021

View reviewed changes

jsignell merged commit 5ad375a into dask:main Dec 20, 2021

jcrist mentioned this pull request Mar 14, 2022

Add sanity checks to divisions setter #8806

Merged

	self._divisions = tuple(divisions)
	self.divisions = divisions

Uh oh!

Conversation

charlesbluca commented Nov 17, 2021 • edited by jsignell Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charlesbluca Nov 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jsignell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsignell left a comment

Choose a reason for hiding this comment

Uh oh!

charlesbluca commented Dec 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

charlesbluca commented Nov 17, 2021 •

edited by jsignell

Loading

charlesbluca Nov 18, 2021 •

edited

Loading