Skip to content

Ensure single-partition join divisions is tuple#8389

Merged
jsignell merged 2 commits intodask:mainfrom
charlesbluca:single-partition-join-divisions
Nov 17, 2021
Merged

Ensure single-partition join divisions is tuple#8389
jsignell merged 2 commits intodask:mainfrom
charlesbluca:single-partition-join-divisions

Conversation

@charlesbluca
Copy link
Member

We ran into some downstream breakage in dask-sql (see dask-contrib/dask-sql#320) as a result of #8341's changing single_partition_join's output divisions from tuple to list - this PR casts them to tuple before returning.

Ideally depending on discussion in #8388, we might enforce all divisions to be one type, which should cut down on breakages like this in the future.

  • Closes #xxxx
  • Tests added / passed
  • Passes pre-commit run --all-files

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for surfacing this @charlesbluca -- and hooray for upstream testing 🎉

This seems like a reasonable quick fix (it's similar to what we do elsewhere), though I agree with you that converging on a standard manner for handling divisions is in order. Is there a regression test we can add here to ensure we don't revert this fix in the future?

cc @gjoseph92 for visibility

@charlesbluca
Copy link
Member Author

A pretty simple test could be to simply perform a join using single_partition_join and then do an assertion on divisions type, not sure if that seems too trivial.

@jsignell
Copy link
Member

Thanks @charlesbluca this is a nice little fix.

@jsignell jsignell merged commit b196e4a into dask:main Nov 17, 2021

assert_eq(result, expected)
assert result.divisions == expected.divisions
assert len(result.__dask_graph__()) < 30
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by this test. Why do we need to check the graph length, and compare the result of merge vs single_partition_join (which should be the same code path)? If we just want a regression test that single_partition_join's divisions are always a tuple, maybe we could just add one line checking that to test_multi.py's series of test_cheap_single_partition_merge_* functions?

Personally I'd much prefer to not see a regression test (or fix) for this specific issue, but rather to make the setter suggested in #8388. That would solve the problem across the board. And it could easily be tested across the board with a couple lines changed in assert_eq and assert_divisions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah TBH I didn't put much thought into this test, just copied some of the other tests from the file but made sure to do the comparison between calling single_partition_join directly versus using merge. I verified that these tests would fail without my small change, I think because merge might be relying on merge_indexed_dataframes instead of single_partition_join.

In general, I agree that we should not have tests like this, and was planning to remove this once whatever solution is decided in #8388 is implemented.

**kwargs,
)
joined.divisions = divisions
joined.divisions = tuple(divisions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. The divisions objects here weren't actually changed by #8341 at all. The issue is just that in the code before, we called new_dd_object, which called into the _Frame constructor, which does coerce divisions passed in to a tuple:

self.divisions = tuple(divisions)

Another vote in favor of a setter to continue this pattern!

charlesbluca added a commit to charlesbluca/dask that referenced this pull request Nov 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants