Ensure that divisions are alway tuples#8393
Conversation
This reverts commit b196e4a.
Co-authored-by: Julia Signell <jsignell@gmail.com>
dask/dataframe/core.py
Outdated
|
|
||
| @divisions.setter | ||
| def divisions(self, value): | ||
| self._divisions = tuple(value) |
There was a problem hiding this comment.
It could also be worth checking that this is sorted in increasing order? If you set divisions to something un-sorted, you'd probably get some weird errors down the line.
There was a problem hiding this comment.
That seems like a sensible check - added this in along with a check that unknown divisions are being set properly (since we already need to check that None isn't in divisions before checking sortedness)
There was a problem hiding this comment.
Following up here, we might want to make an ascending check a little more nuanced, as it looks like there are some situations where technically non-ascending divisions are assigned:
_________________________ test_set_index_categorical __________________________
[gw3] win32 -- Python 3.7.12 C:\Miniconda3\envs\test-environment\python.exe
def test_set_index_categorical():
# https://github.com/dask/dask/issues/5671
order = list(reversed(string.ascii_letters))
values = list(string.ascii_letters)
random.shuffle(values)
dtype = pd.api.types.CategoricalDtype(order, ordered=True)
df = pd.DataFrame({"A": pd.Categorical(values, dtype=dtype), "B": 1})
> result = dd.from_pandas(df, npartitions=2).set_index("A")
dask\dataframe\tests\test_shuffle.py:908:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask\dataframe\core.py:4430: in set_index
**kwargs,
dask\dataframe\shuffle.py:206: in set_index
df, index, divisions, shuffle=shuffle, drop=drop, compute=compute, **kwargs
dask\dataframe\shuffle.py:325: in set_partition
df4.divisions = methods.tolist(divisions)
dask\dataframe\core.py:4194: in __setattr__
object.__setattr__(self, key, value)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = Dask DataFrame Structure:
B
npartitions=2
int64
...
...
Dask Name: set_index_post_scalar, 16 tasks
value = ['Z', 'C', 'a']
@divisions.setter
def divisions(self, value):
if None in value:
if any(v is not None for v in value):
raise ValueError("divisions must be either all null or all non-null")
elif any(a > b for a, b in zip(value, value[1:])):
> raise ValueError("divisions must be sorted in ascending order")
E ValueError: divisions must be sorted in ascending order
dask\dataframe\core.py:390: ValueErrorAlternatively this could be a place where we are setting divisions incorrectly that is now exposed because of the checks 😄
There was a problem hiding this comment.
I have a feeling it's the latter.
There was a problem hiding this comment.
After looking into this, think this is actually on us - this failure is coming up when we attempt to set the index to an ordered categorical column. Though the elements of divisions are strings, they represent categories with a different ordering.
One way I can imagine this could be resolved is if we checked if divisions is sorted when it's cast as a column/index with the same dtype as self.index? Or if we can access the ordering from the categorical dtype, we could make an override specifically for that index dtype.
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
dask/dataframe/core.py
Outdated
|
|
||
| @divisions.setter | ||
| def divisions(self, value): | ||
| self._divisions = tuple(value) |
There was a problem hiding this comment.
I have a feeling it's the latter.
|
|
||
| @divisions.setter | ||
| def divisions(self, value): | ||
| if None in value: |
There was a problem hiding this comment.
Can we also check somewhere that len(value) == self.npartitions + 1?
There was a problem hiding this comment.
Since npartitions is based off of the length of divisions, could we just check that len(value) == len(self._divisions)?
There was a problem hiding this comment.
@jrbourbeau brought up it might be less surprising, and allow us to do a deprecation cycle, if instead of immediately coercing to tuple, we warn if divisions isn't a tuple. This will have the added benefit of snuffling out any places in the dask codebase where we are setting divisions with a list.
There was a problem hiding this comment.
Yeah, specifically I'll propose we:
- Emit a deprecation warning when the divisions setter encounters a non-
tuplevaluethat says something like "we will automatically convert this to a tuple in the future". This will let us inform users who are explicitly settingdivisionstoday about this change. This may be overkill, but is in line with our desire to be better about informing users of behavior changes. This will also cause errors to be raised in our CI in all the places internally wheredivisionsis a non-tupletoday. - Change all the places internally in
dask/daskwhere we're not usingtupledivisions to be atuple. This should result in CI passing again. - Open an issue about the deprecation and when we the deprecation cycle should end
There was a problem hiding this comment.
That sounds good to me - in that case, would we also like to hold off on additional checks on the length / contexts of value until that deprecation cycle ends?
There was a problem hiding this comment.
My sense is a deprecation cycle is still better than raising an error in that case. The goal would be to avoid downstream breakages without some kind of warning first.
There was a problem hiding this comment.
Yeah, definitely in favor of the deprecation cycle here since there are known downstream libraries this would impact (dask-cudf and dask-sql in my case); I just prefer shifting from a DeprecationWarning to a ValueError in the future rather than implicitly converting to a tuple
There was a problem hiding this comment.
Ah, I see. I misinterpreted your previous comment. Post-deprecation, raising an error (probably a TypeError in this case) seems like a reasonable thing to do -- no strong opinion from me
There was a problem hiding this comment.
Following up on the original concern here (length check on the new divisions), it looks like there are several places in the codebase where a divisions with a different length will be provided as a means of repartitioning the dataframe. In cases like this, would we prefer instead using df.repartition(divisions=...)?
I can think of cases where this currently wouldn't work (for example, if we want to repartition using unknown divisions) that seem to motivate leaving out the length check for now, unless we want to address those areas in this PR
There was a problem hiding this comment.
I see how that could be a chicken and the egg problem :), but I don't understand how setting divisions to a different length would work for repartitioning the dataframe. Do you have a specific example of where that happens?
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
jsignell
left a comment
There was a problem hiding this comment.
I was looking to see if there are any comments that should be adjusted and I noticed that there are a lot of other places where divisions are passed as kwargs. set_index and repartition come to mind. I am wondering if those should also only accept tuples. There is a check_division function that you could probably reuse at least.
| ) | ||
| self._meta = meta | ||
| self.divisions = tuple(divisions) | ||
| self._divisions = tuple(divisions) |
There was a problem hiding this comment.
Do we want this coercion still?
There was a problem hiding this comment.
Hmm good point removing this will probably expose several areas in the codebase where we are passing a list divisions to the constructor; happy to change this to something like
| self._divisions = tuple(divisions) | |
| self.divisions = divisions |
IIRC that caused some issues with _divisions not being defined yet but that shouldn't be too difficult to resolve
There was a problem hiding this comment.
If it's not an antipattern to use the setter, that is probably easiest yeah.
There was a problem hiding this comment.
After working through some of the failures resulting from removing this coercion, having some second thoughts on removing it; this would essentially mean that all functions that implicitly set divisions (such as repartition) would need to pass tuple divisions to avoid breakage.
IMO, it would be less restrictive to users to allow iterable divisions for APIs that set them implicitly, and just silently coerce them to tuples internally; then the setter would only come into play when a user (or API) attempts to explicitly set divisions on a constructed frame.
There was a problem hiding this comment.
Ok I can see how that would become a massive change. It is fine with me to start with this, and we can always change it later if we decide that it's preferable to be more explicit.
dask/dataframe/core.py
Outdated
| def divisions(self, value): | ||
| if None in value: | ||
| if any(v is not None for v in value): | ||
| raise ValueError("`divisions` must be either all null or all non-null") |
There was a problem hiding this comment.
This should also be a warning not an Error for this first PR.
There was a problem hiding this comment.
What warning should be used here? I considered PendingDeprecationWarning but to my knowledge divisions with mixed null / non-null values weren't supported before?
There was a problem hiding this comment.
I don't have a strong feeling, FutureWarning or PendingDeprecationWarning is fine with me. There wasn't any check before so people could set whatever divisions they wanted right? Not to say that things would work after that, but they could always set the divisions.
jsignell
left a comment
There was a problem hiding this comment.
This seems like it's in a good state, @charlesbluca is it good to merge from your perspective?
|
@jsignell yeah I think it should be safe to merge this 🙂 |
- Turns `_Frame.divisions` into a property wrapping `_Frame._divisions`, with a setter that casts the provided value into a tuple - Adds an assertion to `assert_divisions` to check that `divisions` is a tuple Co-authored-by: Julia Signell <jsignell@gmail.com> Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
This PR does the following:
_Frame.divisionsinto a property wrapping_Frame._divisions, with a setter that casts the provided value into a tupleassert_divisionsto check thatdivisionsis a tuplecc @jsignell would you mind helping out with the docstring for the divisions property?
divisions#8388pre-commit run --all-files