Ensure single-partition join divisions is tuple#8389
Conversation
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks for surfacing this @charlesbluca -- and hooray for upstream testing 🎉
This seems like a reasonable quick fix (it's similar to what we do elsewhere), though I agree with you that converging on a standard manner for handling divisions is in order. Is there a regression test we can add here to ensure we don't revert this fix in the future?
cc @gjoseph92 for visibility
|
A pretty simple test could be to simply perform a join using |
|
Thanks @charlesbluca this is a nice little fix. |
|
|
||
| assert_eq(result, expected) | ||
| assert result.divisions == expected.divisions | ||
| assert len(result.__dask_graph__()) < 30 |
There was a problem hiding this comment.
I'm a little confused by this test. Why do we need to check the graph length, and compare the result of merge vs single_partition_join (which should be the same code path)? If we just want a regression test that single_partition_join's divisions are always a tuple, maybe we could just add one line checking that to test_multi.py's series of test_cheap_single_partition_merge_* functions?
Personally I'd much prefer to not see a regression test (or fix) for this specific issue, but rather to make the setter suggested in #8388. That would solve the problem across the board. And it could easily be tested across the board with a couple lines changed in assert_eq and assert_divisions.
There was a problem hiding this comment.
Yeah TBH I didn't put much thought into this test, just copied some of the other tests from the file but made sure to do the comparison between calling single_partition_join directly versus using merge. I verified that these tests would fail without my small change, I think because merge might be relying on merge_indexed_dataframes instead of single_partition_join.
In general, I agree that we should not have tests like this, and was planning to remove this once whatever solution is decided in #8388 is implemented.
| **kwargs, | ||
| ) | ||
| joined.divisions = divisions | ||
| joined.divisions = tuple(divisions) |
There was a problem hiding this comment.
Interesting. The divisions objects here weren't actually changed by #8341 at all. The issue is just that in the code before, we called new_dd_object, which called into the _Frame constructor, which does coerce divisions passed in to a tuple:
Line 325 in 6c5b482
Another vote in favor of a setter to continue this pattern!
This reverts commit b196e4a.
We ran into some downstream breakage in dask-sql (see dask-contrib/dask-sql#320) as a result of #8341's changing
single_partition_join's outputdivisionsfrom tuple to list - this PR casts them to tuple before returning.Ideally depending on discussion in #8388, we might enforce all
divisionsto be one type, which should cut down on breakages like this in the future.pre-commit run --all-files