Skip to content

Ensure that divisions are alway tuples#8393

Merged
jsignell merged 21 commits intodask:mainfrom
charlesbluca:ensure-tuple-divisions
Dec 20, 2021
Merged

Ensure that divisions are alway tuples#8393
jsignell merged 21 commits intodask:mainfrom
charlesbluca:ensure-tuple-divisions

Conversation

@charlesbluca
Copy link
Member

@charlesbluca charlesbluca commented Nov 17, 2021

This PR does the following:

  • Turns _Frame.divisions into a property wrapping _Frame._divisions, with a setter that casts the provided value into a tuple
  • Adds an assertion to assert_divisions to check that divisions is a tuple

cc @jsignell would you mind helping out with the docstring for the divisions property?

@charlesbluca charlesbluca marked this pull request as draft November 17, 2021 19:13
charlesbluca and others added 2 commits November 17, 2021 15:34
Co-authored-by: Julia Signell <jsignell@gmail.com>

@divisions.setter
def divisions(self, value):
self._divisions = tuple(value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could also be worth checking that this is sorted in increasing order? If you set divisions to something un-sorted, you'd probably get some weird errors down the line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a sensible check - added this in along with a check that unknown divisions are being set properly (since we already need to check that None isn't in divisions before checking sortedness)

Copy link
Member Author

@charlesbluca charlesbluca Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up here, we might want to make an ascending check a little more nuanced, as it looks like there are some situations where technically non-ascending divisions are assigned:

_________________________ test_set_index_categorical __________________________
[gw3] win32 -- Python 3.7.12 C:\Miniconda3\envs\test-environment\python.exe

    def test_set_index_categorical():
        # https://github.com/dask/dask/issues/5671
        order = list(reversed(string.ascii_letters))
        values = list(string.ascii_letters)
        random.shuffle(values)
        dtype = pd.api.types.CategoricalDtype(order, ordered=True)
        df = pd.DataFrame({"A": pd.Categorical(values, dtype=dtype), "B": 1})
    
>       result = dd.from_pandas(df, npartitions=2).set_index("A")

dask\dataframe\tests\test_shuffle.py:908: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask\dataframe\core.py:4430: in set_index
    **kwargs,
dask\dataframe\shuffle.py:206: in set_index
    df, index, divisions, shuffle=shuffle, drop=drop, compute=compute, **kwargs
dask\dataframe\shuffle.py:325: in set_partition
    df4.divisions = methods.tolist(divisions)
dask\dataframe\core.py:4194: in __setattr__
    object.__setattr__(self, key, value)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Dask DataFrame Structure:
                   B
npartitions=2       
               int64
                 ...
                 ...
Dask Name: set_index_post_scalar, 16 tasks
value = ['Z', 'C', 'a']

    @divisions.setter
    def divisions(self, value):
        if None in value:
            if any(v is not None for v in value):
                raise ValueError("divisions must be either all null or all non-null")
        elif any(a > b for a, b in zip(value, value[1:])):
>           raise ValueError("divisions must be sorted in ascending order")
E           ValueError: divisions must be sorted in ascending order

dask\dataframe\core.py:390: ValueError

Alternatively this could be a place where we are setting divisions incorrectly that is now exposed because of the checks 😄

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling it's the latter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking into this, think this is actually on us - this failure is coming up when we attempt to set the index to an ordered categorical column. Though the elements of divisions are strings, they represent categories with a different ordering.

One way I can imagine this could be resolved is if we checked if divisions is sorted when it's cast as a column/index with the same dtype as self.index? Or if we can access the ordering from the categorical dtype, we could make an override specifically for that index dtype.


@divisions.setter
def divisions(self, value):
self._divisions = tuple(value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling it's the latter.


@divisions.setter
def divisions(self, value):
if None in value:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also check somewhere that len(value) == self.npartitions + 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since npartitions is based off of the length of divisions, could we just check that len(value) == len(self._divisions)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrbourbeau brought up it might be less surprising, and allow us to do a deprecation cycle, if instead of immediately coercing to tuple, we warn if divisions isn't a tuple. This will have the added benefit of snuffling out any places in the dask codebase where we are setting divisions with a list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, specifically I'll propose we:

  1. Emit a deprecation warning when the divisions setter encounters a non-tuple value that says something like "we will automatically convert this to a tuple in the future". This will let us inform users who are explicitly setting divisions today about this change. This may be overkill, but is in line with our desire to be better about informing users of behavior changes. This will also cause errors to be raised in our CI in all the places internally where divisions is a non-tuple today.
  2. Change all the places internally in dask/dask where we're not using tuple divisions to be a tuple. This should result in CI passing again.
  3. Open an issue about the deprecation and when we the deprecation cycle should end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me - in that case, would we also like to hold off on additional checks on the length / contexts of value until that deprecation cycle ends?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My sense is a deprecation cycle is still better than raising an error in that case. The goal would be to avoid downstream breakages without some kind of warning first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, definitely in favor of the deprecation cycle here since there are known downstream libraries this would impact (dask-cudf and dask-sql in my case); I just prefer shifting from a DeprecationWarning to a ValueError in the future rather than implicitly converting to a tuple

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I misinterpreted your previous comment. Post-deprecation, raising an error (probably a TypeError in this case) seems like a reasonable thing to do -- no strong opinion from me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on the original concern here (length check on the new divisions), it looks like there are several places in the codebase where a divisions with a different length will be provided as a means of repartitioning the dataframe. In cases like this, would we prefer instead using df.repartition(divisions=...)?

I can think of cases where this currently wouldn't work (for example, if we want to repartition using unknown divisions) that seem to motivate leaving out the length check for now, unless we want to address those areas in this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see how that could be a chicken and the egg problem :), but I don't understand how setting divisions to a different length would work for repartitioning the dataframe. Do you have a specific example of where that happens?

@charlesbluca charlesbluca marked this pull request as ready for review November 29, 2021 14:56
Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking to see if there are any comments that should be adjusted and I noticed that there are a lot of other places where divisions are passed as kwargs. set_index and repartition come to mind. I am wondering if those should also only accept tuples. There is a check_division function that you could probably reuse at least.

)
self._meta = meta
self.divisions = tuple(divisions)
self._divisions = tuple(divisions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this coercion still?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good point removing this will probably expose several areas in the codebase where we are passing a list divisions to the constructor; happy to change this to something like

Suggested change
self._divisions = tuple(divisions)
self.divisions = divisions

IIRC that caused some issues with _divisions not being defined yet but that shouldn't be too difficult to resolve

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not an antipattern to use the setter, that is probably easiest yeah.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After working through some of the failures resulting from removing this coercion, having some second thoughts on removing it; this would essentially mean that all functions that implicitly set divisions (such as repartition) would need to pass tuple divisions to avoid breakage.

IMO, it would be less restrictive to users to allow iterable divisions for APIs that set them implicitly, and just silently coerce them to tuples internally; then the setter would only come into play when a user (or API) attempts to explicitly set divisions on a constructed frame.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I can see how that would become a massive change. It is fine with me to start with this, and we can always change it later if we decide that it's preferable to be more explicit.

def divisions(self, value):
if None in value:
if any(v is not None for v in value):
raise ValueError("`divisions` must be either all null or all non-null")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be a warning not an Error for this first PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What warning should be used here? I considered PendingDeprecationWarning but to my knowledge divisions with mixed null / non-null values weren't supported before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong feeling, FutureWarning or PendingDeprecationWarning is fine with me. There wasn't any check before so people could set whatever divisions they wanted right? Not to say that things would work after that, but they could always set the divisions.

Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it's in a good state, @charlesbluca is it good to merge from your perspective?

@charlesbluca
Copy link
Member Author

@jsignell yeah I think it should be safe to merge this 🙂

@jsignell jsignell merged commit 5ad375a into dask:main Dec 20, 2021
aeisenbarth pushed a commit to aeisenbarth/dask that referenced this pull request Jan 6, 2022
- Turns `_Frame.divisions` into a property wrapping `_Frame._divisions`, with a setter that casts the provided value into a tuple
- Adds an assertion to `assert_divisions` to check that `divisions` is a tuple

Co-authored-by: Julia Signell <jsignell@gmail.com>
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Standardizing type for divisions

5 participants