Raise exception for not implemented type of merge in dataframe by ncclementi · Pull Request #8138 · dask/dask

ncclementi · 2021-09-13T19:53:28Z

Closes #xxxx
Tests added / passed
Passes black dask / flake8 dask / isort dask

This PR raises an exception with a better message when attempting to perform a merge with a type how="something" that it is not implemented or doesn't exist. For example, the code from issue #8119

dd.merge(left, right, how="cross") currently fails with:

MergeError: Can not pass on, right_on, left_on or set right_index=True or left_index=True

due to how="cross" not being implemented.

with this PR the traceback looks like:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-4-1a5a6740d123> in <module>
----> 1 dd.merge(left, right, how="cross")

~/Documents/git/my_forks/dask/dask/dataframe/multi.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, suffixes, indicator, npartitions, shuffle, max_branch, broadcast)
    495 
    496     if how not in ("left", "right", "outer", "inner"):
--> 497         raise Exception(f"Type of merge how = '{how}' is not implemented or does not exists")
    498 
    499     if isinstance(left, (pd.Series, pd.DataFrame)) and isinstance(

Exception: Type of merge how = 'cross' is not implemented or does not exist

jsignell

I have a minor suggestion and it would be great to include a test that raises this exception.

dask/dataframe/multi.py

Co-authored-by: Julia Signell <jsignell@gmail.com>

jsignell · 2021-09-14T20:32:00Z

dask/dataframe/tests/test_multi.py

+        }
+    )
+
+    with pytest.raises(Exception):


nitpick, but you can use a specific class of Exception and match= to make sure that you are getting the error you expect.

+1. Here's an example of another place in the codebase were we specify the specific type of error and message we expect to be raised:

dask/dask/array/tests/test_stats.py

Lines 146 to 147 in 1797c2b

with pytest.raises(ValueError, match="7 samples"):

dask.array.stats.skewtest(a)

Thanks for the feedback and example. If the match sentence is too long I can cut it to just match="how='cross". I wasn't sure what do we prefer in this case.

jrbourbeau · 2021-09-14T20:51:50Z

dask/dataframe/multi.py


+    supported_how = ("left", "right", "outer", "inner")
+    if how not in supported_how:
+        raise Exception(


Let's raise a more specific error here instead of the Exception base class. A ValueError seems appropriate in this particular case. (Sorry I should have brought this up in my earlier comment)

no problem, : ) just updated

dask/dataframe/tests/test_multi.py

jrbourbeau

It looks like this is causing legitimate GPU failures. For example

15:58:52 ____________ test_merge_tasks_semi_anti_cudf[cudf-leftsemi-parts1] _____________
15:58:52 [gw0] linux -- Python 3.8.10 /opt/conda/envs/dask/bin/python
15:58:52 
15:58:52 engine = 'cudf', how = 'leftsemi', parts = (3, 1)
15:58:52 
15:58:52     @pytest.mark.gpu
15:58:52     @pytest.mark.parametrize("parts", [(3, 3), (3, 1), (1, 3)])
15:58:52     @pytest.mark.parametrize("how", ["leftsemi", "leftanti"])
15:58:52     @pytest.mark.parametrize(
15:58:52         "engine",
15:58:52         [
15:58:52             "cudf",
15:58:52             pytest.param(
15:58:52                 "pandas",
15:58:52                 marks=pytest.mark.xfail(
15:58:52                     reason="Pandas does not support leftsemi or leftanti"
15:58:52                 ),
15:58:52             ),
15:58:52         ],
15:58:52     )
15:58:52     def test_merge_tasks_semi_anti_cudf(engine, how, parts):
15:58:52         if engine == "cudf":
15:58:52             # NOTE: engine == "cudf" requires cudf/dask_cudf,
15:58:52             # will be skipped by non-GPU CI.
15:58:52     
15:58:52             cudf = pytest.importorskip("cudf")
15:58:52             dask_cudf = pytest.importorskip("dask_cudf")
15:58:52     
15:58:52         emp = pd.DataFrame(
15:58:52             {
15:58:52                 "emp_id": np.arange(101, stop=106),
15:58:52                 "name": ["John", "Tom", "Harry", "Rahul", "Sakil"],
15:58:52                 "city": ["Cal", "Mum", "Del", "Ban", "Del"],
15:58:52                 "salary": [50000, 40000, 80000, 60000, 90000],
15:58:52             }
15:58:52         )
15:58:52         skills = pd.DataFrame(
15:58:52             {
15:58:52                 "skill_id": [404, 405, 406, 407, 408],
15:58:52                 "emp_id": [103, 101, 105, 102, 101],
15:58:52                 "skill_name": ["Dask", "Spark", "C", "Python", "R"],
15:58:52             }
15:58:52         )
15:58:52     
15:58:52         if engine == "cudf":
15:58:52             emp = cudf.from_pandas(emp)
15:58:52             skills = cudf.from_pandas(skills)
15:58:52             dd_emp = dask_cudf.from_cudf(emp, npartitions=parts[0])
15:58:52             dd_skills = dask_cudf.from_cudf(skills, npartitions=parts[1])
15:58:52         else:
15:58:52             dd_emp = dd.from_pandas(emp, npartitions=parts[0])
15:58:52             dd_skills = dd.from_pandas(skills, npartitions=parts[1])
15:58:52     
15:58:52         expect = emp.merge(skills, on="emp_id", how=how).sort_values(["emp_id"])
15:58:52 >       result = dd_emp.merge(dd_skills, on="emp_id", how=how).sort_values(["emp_id"])
15:58:52 
15:58:52 dask/dataframe/tests/test_multi.py:919: 
15:58:52 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
15:58:52 /opt/conda/envs/dask/lib/python3.8/site-packages/dask_cudf/core.py:139: in merge
15:58:52     return super().merge(other, on=on, shuffle="tasks", **kwargs)
15:58:52 dask/dataframe/core.py:4620: in merge
15:58:52     return merge(
15:58:52 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
15:58:52 
15:58:52 left = <dask_cudf.DataFrame | 2 tasks | 2 npartitions>
15:58:52 right = <dask_cudf.DataFrame | 1 tasks | 1 npartitions>, how = 'leftsemi'
15:58:52 on = None, left_on = 'emp_id', right_on = 'emp_id', left_index = False
15:58:52 right_index = False, suffixes = ('_x', '_y'), indicator = False
15:58:52 npartitions = None, shuffle = 'tasks', max_branch = None, broadcast = None
15:58:52 
15:58:52     @wraps(pd.merge)
15:58:52     def merge(
15:58:52         left,
15:58:52         right,
15:58:52         how="inner",
15:58:52         on=None,
15:58:52         left_on=None,
15:58:52         right_on=None,
15:58:52         left_index=False,
15:58:52         right_index=False,
15:58:52         suffixes=("_x", "_y"),
15:58:52         indicator=False,
15:58:52         npartitions=None,
15:58:52         shuffle=None,
15:58:52         max_branch=None,
15:58:52         broadcast=None,
15:58:52     ):
15:58:52         for o in [on, left_on, right_on]:
15:58:52             if isinstance(o, _Frame):
15:58:52                 raise NotImplementedError(
15:58:52                     "Dask collections not currently allowed in merge columns"
15:58:52                 )
15:58:52         if not on and not left_on and not right_on and not left_index and not right_index:
15:58:52             on = [c for c in left.columns if c in right.columns]
15:58:52             if not on:
15:58:52                 left_index = right_index = True
15:58:52     
15:58:52         if on and not left_on and not right_on:
15:58:52             left_on = right_on = on
15:58:52             on = None
15:58:52     
15:58:52         supported_how = ("left", "right", "outer", "inner")
15:58:52         if how not in supported_how:
15:58:52 >           raise ValueError(
15:58:52                 f"dask.dataframe.merge does not support how='{how}'. Options are: {supported_how}"
15:58:52             )
15:58:52 E           ValueError: dask.dataframe.merge does not support how='leftsemi'. Options are: ('left', 'right', 'outer', 'inner')
15:58:52 
15:58:52 dask/dataframe/multi.py:498: ValueError

It looks like cuDF's merge(...) supports additional options for how= (e.g. "leftsemi"), though the corresponding API docs say {‘left’, ‘outer’, ‘inner’} are the supported options. @rjzamora @jakirkham can you comment on what values for how= cuDF supports?

jakirkham · 2021-09-15T06:10:42Z

cc @galipremsagar (in case you have thoughts here 🙂)

ncclementi · 2021-10-14T14:37:11Z

Fixed the merge conflicts but we are still having issues with the gpuCI, see #8138 (review) @rjzamora would you be able to comment on this, and suggest a possible approach?

) * Fix :DataFrame.head shouldn't warn when there's one partition * Fixups - Add test - Simplify logic Co-authored-by: Jim Crist-Harif <jcristharif@gmail.com>

jrbourbeau · 2021-10-14T22:24:49Z

It looks like cudf also supports a how='leftanti' option

jakirkham · 2021-10-14T22:42:20Z

JFYI Rick's OOTO. So it might be a bit before we hear from here. Will check-in with him about this when he gets back

rjzamora · 2021-10-14T23:01:02Z

dask/dataframe/multi.py

+    supported_how = ("left", "right", "outer", "inner")
+    if how not in supported_how:
+        raise ValueError(
+            f"dask.dataframe.merge does not support how='{how}'. Options are: {supported_how}"
+        )
+


My only concern is that cudf also supports "leftsemi" and "leftanti", and dask-cudf is currently using this code. Not sure of the best way to deal with this variation between pandas and cudf.

galipremsagar · 2021-10-14T23:03:43Z

can you comment on what values for how= cuDF supports?

cudf merge supports "left", "inner", "outer", "leftanti", "leftsemi", instead of this approach. Could we have an approach where we have a dispatch method for validating how. I know dispatch just for a parameter sounds like overkill. But this is what is currently coming to my mind.

rjzamora · 2021-10-14T23:19:49Z

cudf merge supports "left", "inner", "outer", "leftanti", "leftsemi", instead of this approach. Could we have an approach where we have a dispatch method for validating how. I know dispatch just for a parameter sounds like overkill. But this is what is currently coming to my mind.

The "easiest" approach is probably to include "leftanti" and "leftsemi" in the list of supported options (as long as pandas raises a reasonable error).

galipremsagar · 2021-10-14T23:22:56Z

cudf merge supports "left", "inner", "outer", "leftanti", "leftsemi", instead of this approach. Could we have an approach where we have a dispatch method for validating how. I know dispatch just for a parameter sounds like overkill. But this is what is currently coming to my mind.

The "easiest" approach is probably to include "leftanti" and "leftsemi" in the list of supported options (as long as pandas raises a reasonable error).

Yeah this would be possible:

>>> import pandas as pd
>>> df = pd.DataFrame({'a':[1, 2, 3]})
>>> df
   a
0  1
1  2
2  3
>>> df.merge(df, how='ll')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/core/frame.py", line 9191, in merge
    return merge(
  File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 120, in merge
    return op.get_result()
  File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 714, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 965, in _get_join_info
    (left_indexer, right_indexer) = self._get_join_indexers()
  File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 939, in _get_join_indexers
    return get_join_indexers(
  File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1495, in get_join_indexers
    join_func = {
KeyError: 'll'
>>> df.merge(df, how='leftanti')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/core/frame.py", line 9191, in merge
    return merge(
  File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 120, in merge
    return op.get_result()
  File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 714, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 965, in _get_join_info
    (left_indexer, right_indexer) = self._get_join_indexers()
  File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 939, in _get_join_indexers
    return get_join_indexers(
  File "/nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1495, in get_join_indexers
    join_func = {
KeyError: 'leftanti'
>>> df.merge(df, how='cross')
   a_x  a_y
0    1    1
1    1    2
2    1    3
3    2    1
4    2    2
5    2    3
6    3    1
7    3    2
8    3    3

ncclementi · 2021-10-18T15:19:53Z

It seems the easiest fix is to add "leftanti", "leftsemi" to the supported_options, although these options are not supported by dask but they are on CuDF. I'm a little hesitant to include this since it can be misleading, unless we add a comment explaining this. something like.

" "leftanti", "leftsemi" or not actually supported but they were added to this list since CuDF supports them and dask_cudf relies on this code."

@jrbourbeau do you think this ^ is enough, or we should find a workaround.

@choldgraf

Implements the suggestion proposed by @choldgraf here dask#8227 (comment) to try and cut down out documentation build time

The expected behavior for `dd.info(verbose=True)` should be to also return the total memory being used, this PR brings dask in line with pandas and will prevent confusions like issue dask#8115

to_zarr already handles it so this allows from_zarr to be on par with it.

Co-authored-by: Jim Crist-Harif <jcristharif@gmail.com>

Closes dask#5124.

* Update tokenize to treat dict and kwargs differently * Apply suggestion from Jim's review

…array (dask#8723)

…ask#8727)

…rs (dask#8589)

This PR moves the handling of custom sorting functions to `shuffle.sort_values`, so that usages of the internal `sort_values` function will not have to manually specify a default `sort_function` and `sort_function_kwargs`. This originated as a concern in the downstream implementation of this in rapidsai/cudf#9789

jsignell · 2022-03-16T16:59:56Z

I think you can just go ahead and add them to the list. This is an improvement over the current state.

ncclementi · 2022-03-16T20:13:57Z

Thanks for the ping @jsignell I push a change to include them.

ncclementi · 2022-03-16T20:13:57Z

Thanks for the ping @jsignell I push a change to include them.

jsignell · 2022-03-16T20:28:01Z

Hmmm something seems to be a little wrong with the diff on this.

ncclementi · 2022-03-16T20:47:42Z

Oh shoot, I just noticed that. I might have merge main incorrectly on my local version. I can open a new PR that says it superseeds this one and get that solve. Unles there is a better way of doing this.

ncclementi · 2022-03-16T20:47:42Z

Oh shoot, I just noticed that. I might have merge main incorrectly on my local version. I can open a new PR that says it superseeds this one and get that solve. Unles there is a better way of doing this.

jcrist · 2022-03-17T13:53:26Z

Superseded by #8818, closing.

raise exception for not implemented how merge-type

99ab8e2

github-actions bot added the dataframe label Sep 13, 2021

ncclementi requested a review from jsignell September 13, 2021 19:53

jsignell reviewed Sep 14, 2021

View reviewed changes

dask/dataframe/multi.py Outdated Show resolved Hide resolved

ncclementi and others added 2 commits September 14, 2021 10:56

Update dask/dataframe/multi.py

df241e8

Co-authored-by: Julia Signell <jsignell@gmail.com>

add test for how raises

b50c527

jsignell reviewed Sep 14, 2021

View reviewed changes

add match to exception raise test

fef30cf

jrbourbeau reviewed Sep 14, 2021

View reviewed changes

replace Exception for ValueError

27e294b

jrbourbeau reviewed Sep 15, 2021

View reviewed changes

GenevieveBuckley mentioned this pull request Oct 14, 2021

Stale issue / PR sprint dask/community#188

Closed

Merge branch 'main' into cross_exception

20886a8

Fix :DataFrame.head shouldn't warn when there's one partition (dask#8091

9c94ea2

) * Fix :DataFrame.head shouldn't warn when there's one partition * Fixups - Add test - Simplify logic Co-authored-by: Jim Crist-Harif <jcristharif@gmail.com>

rjzamora reviewed Oct 14, 2021

View reviewed changes

Add workflow to update gpuCI (dask#8215)

3067379

ncclementi mentioned this pull request Oct 18, 2021

[DOC] cudf.DataFrame.merge how options incomplete rapidsai/cudf#9460

Closed

jrbourbeau and others added 5 commits October 18, 2021 10:59

Ignore whitespace in gufunc signature (dask#8267)

ee6814f

Remove individual API doc pages from sphinx toctree (dask#8238)

f461912

Implements the suggestion proposed by @choldgraf here dask#8227 (comment) to try and cut down out documentation build time

Set memory_usage to True if verbose is True in info (dask#8222)

a06ee6a

The expected behavior for `dd.info(verbose=True)` should be to also return the total memory being used, this PR brings dask in line with pandas and will prevent confusions like issue dask#8115

Make nested redirects work (dask#8272)

361a2f5

fix: support Path objects in from_zarr (dask#8168) (dask#8266)

62f96e7

to_zarr already handles it so this allows from_zarr to be on par with it.

Dranaxel and others added 13 commits February 15, 2022 10:57

Clarify dask.visualize docstring (dask#8710)

4f44ab1

Co-authored-by: Jim Crist-Harif <jcristharif@gmail.com>

xfailed scheduler_HLG_unpack_import; flaky test (dask#8724)

b331917

Do not allow iterating a DataFrameGroupBy (dask#8696)

de88ce9

Closes dask#5124.

Update tokenize to treat dict and kwargs differently (dask#8655)

943a7b2

* Update tokenize to treat dict and kwargs differently * Apply suggestion from Jim's review

fixed bug in dask.array.roll() for roll-shifts the size of the input …

dca1039

…array (dask#8723)

Temporarily remove scipy upstream CI build (dask#8725)

baa378f

Update Docker example to use current best practices (dask#8731)

fc2e151

Fix upstream missing newline after info() call on empty DataFrame (d…

47fa383

…ask#8727)

Bump pre-release version to be greater than stable releases (dask#8728)

59d4f0f

Drop Python 3.7 (dask#8572)

bd8e8dc

Add materialized task counts to HighLevelGraph and Layer html rep…

2ed4545

…rs (dask#8589)

Pin cloudpickle and scipy in docs reqs (dask#8737)

f634f2b

ncclementi added 3 commits March 16, 2022 15:54

Merge branch 'cross_exception' of github.com:ncclementi/dask

7013605

add leftanti and leftsemi as supported version with cudf note

267982a

fix typo

738f10e

github-actions bot added array dispatch Related to `Dispatch` extension objects documentation Improve or add to documentation io labels Mar 16, 2022

ncclementi mentioned this pull request Mar 16, 2022

Raise exception for not implemented merge 'how' option #8818

Merged

4 tasks

jcrist closed this Mar 17, 2022

pavithraes mentioned this pull request Mar 30, 2022

Raise ValueError in merge_asof for duplicate kwargs #8861

Merged

2 tasks

	with pytest.raises(ValueError, match="7 samples"):
	dask.array.stats.skewtest(a)

Uh oh!

Conversation

ncclementi commented Sep 13, 2021

Uh oh!

jsignell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jsignell Sep 14, 2021

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Sep 14, 2021

Choose a reason for hiding this comment

Uh oh!

ncclementi Sep 14, 2021

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Sep 14, 2021

Choose a reason for hiding this comment

Uh oh!

ncclementi Sep 14, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Sep 15, 2021

Uh oh!

ncclementi commented Oct 14, 2021

Uh oh!

jrbourbeau commented Oct 14, 2021

Uh oh!

jakirkham commented Oct 14, 2021

Uh oh!

rjzamora Oct 14, 2021

Choose a reason for hiding this comment

Uh oh!

galipremsagar commented Oct 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented Oct 14, 2021

Uh oh!

galipremsagar commented Oct 14, 2021

Uh oh!

ncclementi commented Oct 18, 2021

Uh oh!

jsignell commented Mar 16, 2022

Uh oh!

ncclementi commented Mar 16, 2022

Uh oh!

ncclementi commented Mar 16, 2022

Uh oh!

jsignell commented Mar 16, 2022

Uh oh!

ncclementi commented Mar 16, 2022

Uh oh!

ncclementi commented Mar 16, 2022

Uh oh!

jcrist commented Mar 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

galipremsagar commented Oct 14, 2021 •

edited

Loading