Fix meta creation incase of `object` by galipremsagar · Pull Request #7586 · dask/dask

galipremsagar · 2021-04-21T23:16:20Z

There is an issue rapidsai/cudf#7946, where the metadata creations ends up being something that is purely based on the order of importing a backend instead of the correct backend itself.

So the root cause to the above issue is that we have make_meta dispatch registered both in dask and dask-cudf, against the same type object, wherein the Dispatch class will end up storing the function of the last/most-recently registered dispatch only(since its a simple dict). In this PR I have made a new utility function by the name make_meta_util, and a new dispatch that is responsible for object meta creation but is registered against object of a specific backend, this way we can guarantee the correct metadata is being generated and in-turn the right backend APIs are invoked.

One additional thing we would have to do to facilitate this change is that we would have to pass in parent_meta instead of meta as we would want to know the real API back-end which we would want to invoke, because some pandas APIs return numpy objects which are then stored as meta and will not be really helpful for us to determine which back-end to hit, but when we store/pass parent_meta to this utility that will help us to determine correctly the backend that is needed and accordingly dispatched.
cc: @jakirkham @quasiben @rjzamora @beckernick @kkraus14

Tests added / passed
Passes black dask / flake8 dask / isort dask

quasiben · 2021-05-18T16:34:10Z

@galipremsagar apologies for letting this slip. @pentschev would you have time to review this ?

pentschev

Overall looks good, I've added a minor suggestion. However, DataFrame meta is a bit different from Array meta, which I'm not totally familiar with, so maybe would be good to get a review from someone more knowledgeable as well, perhaps @rjzamora ?

pentschev · 2021-05-18T17:29:57Z

dask/dataframe/utils.py

+def make_meta_util(x, index=None, parent_meta=None):
+    import dask.dataframe as dd
+
+    if isinstance(x, (dd.core.Series, dd.core.DataFrame)):


I think dd.core.Series and dd.core.DataFrame already have _meta, no? In that case this may be redundant and could be removed in favor of the condition immediately below: if hasattr(x, "_meta").

Right, make_meta_object does something like the following to capture these cases:

if hasattr(x, "_meta"): return x._meta elif is_arraylike(x) and x.shape: return x[:0]

However, if that logic is already captured in make_meta_object (which is now registered with make_meta_obj), it doesn't seem like this check is necessary at all. Is this correct?

Good catch. I think your assessment is right, Rick. Unless I'm missing something, we could indeed remove both checks here.

Yes, this is a redundant check. I got rid of it in make_meta_object. The reason being since make_meta dispatch was registered against generic python object type, all objects would simply pass through the make_meta, but with the change being made this won't happen so it will be the task of make_meta_util to handle this. The is_arraylike check however shouldn't be done in make_meta_util as objects like pd.Series/cudf.Series need to go through the dispatch mechanism below.

rjzamora

Thank you for working on this @galipremsagar !

I am sorry for being late to the party here, but I'd like to clarify the solution a bit before signing off...

My understanding is that the general @make_meta.register(object) definition in dask-cudf (which is registered upon import) is incorrectly taking over when cudf is not even in use. If the make_meta dispatching was being performed in the way that concat is used throughout Dask-Dataframe, then there would be a "middle-man" function where we could add a kwarg like parent_meta=. However, Dask-Dataframe is using the dispatch name directly (make_meta), and so you are effectively adding this "middle-man" function under the name make_meta_util (and using it throughout the code base). Is this correct?

If so, then I think this solution makes sense -- At least, I cannot think of a better way to do it :)

rjzamora · 2021-05-18T18:26:27Z

dask/dataframe/utils.py

+        return x._meta
+
+    try:
+        return make_meta(x, index=index)


Does this mean we still need to remove the make_meta.register(object) definition in dask-cudf before the original issue is resolved? Not a problem if so, I am just trying to understand everything.

Correct. We'll need dask-cudf side changes to accommodate this new dispatch - make_meta_object.

rjzamora · 2021-05-18T18:30:29Z

dask/dataframe/utils.py

+def make_meta_util(x, index=None, parent_meta=None):
+    import dask.dataframe as dd
+
+    if isinstance(x, (dd.core.Series, dd.core.DataFrame)):


Right, make_meta_object does something like the following to capture these cases:

if hasattr(x, "_meta"): return x._meta elif is_arraylike(x) and x.shape: return x[:0]

However, if that logic is already captured in make_meta_object (which is now registered with make_meta_obj), it doesn't seem like this check is necessary at all. Is this correct?

galipremsagar · 2021-05-19T16:16:29Z

Thank you for working on this @galipremsagar !

I am sorry for being late to the party here, but I'd like to clarify the solution a bit before signing off...

My understanding is that the general @make_meta.register(object) definition in dask-cudf (which is registered upon import) is incorrectly taking over when cudf is not even in use. If the make_meta dispatching was being performed in the way that concat is used throughout Dask-Dataframe, then there would be a "middle-man" function where we could add a kwarg like parent_meta=. However, Dask-Dataframe is using the dispatch name directly (make_meta), and so you are effectively adding this "middle-man" function under the name make_meta_util (and using it throughout the code base). Is this correct?

If so, then I think this solution makes sense -- At least, I cannot think of a better way to do it :)

@rjzamora yes, you got it right.

However, I'm adding a scipy import in utils.py, which is causing the github test_imports to fail. Any idea where all do we need to add this package to? @rjzamora @jakirkham

rjzamora

Thanks Prem! This is looking good.

I am wondering if we can make the changes a bit lighter if we: (1) Avoid changing to the new_dd_object function signature, (2) Avoid making scipy a requirement, (3) avoid passing parent_meta to map_partitions, and (4) avoid passing parent_meta in places where an already-provided meta object is sufficient.

There is a perfectly good chance that I am misunderstanding the changes. So, feel free to push back on any of my comments/suggestions :)

rjzamora · 2021-05-20T22:30:27Z

dask/dataframe/core.py



-def new_dd_object(dsk, name, meta, divisions):
+def new_dd_object(dsk, name, meta, divisions, parent_meta=None):


Suggested change

def new_dd_object(dsk, name, meta, divisions, parent_meta=None):

def new_dd_object(dsk, name, meta, divisions):

Doesn't look like this change is necessary (I don't think parent_meta is used in new_dd_object)

Needed for this: https://github.com/dask/dask/pull/7586/files#r637579242

rjzamora · 2021-05-20T22:37:15Z

dask/dataframe/core.py

    $META
    """
    name = kwargs.pop("token", None)
+    parent_meta = kwargs.pop("parent_meta", None)


Are there expected cases where a user would need the option of passing this in. Is there an expected reason that the _meta of the first _Frame in args is not a good assumption?

This would be similar to the above one, when _Frame list can be all empty or the args has no _Frame at all and the meta at hand is way too generic.

rjzamora · 2021-05-20T22:38:38Z

dask/dataframe/core.py

    divisions = [None] * (split_out + 1)

-    return new_dd_object(graph, b, meta, divisions)
+    return new_dd_object(graph, b, meta, divisions, parent_meta=dfs[0]._meta)


Suggested change

return new_dd_object(graph, b, meta, divisions, parent_meta=dfs[0]._meta)

return new_dd_object(graph, b, meta, divisions)

I don't think we need this, but I may be missing something.

Similar issue here, meta can be too generic :

> /nvme/0/pgali/cudf/dask/dask/dataframe/core.py(5535)apply_concat_apply() (Pdb) dfs[0]._meta Series([], dtype: int64) (Pdb) meta 1

rjzamora · 2021-05-20T22:43:30Z

dask/dataframe/core.py

            )
            warnings.warn(meta_warning(meta))
-
+        kwds.update({"parent_meta": self._meta})


It would be nice if we didn't need to pass parent metadata into map_partitions since we could easily use the _meta on the first _Frame object within that function.

The reason map_partitions would need a parent_meta is dfs frame objects list can be empty too:

dask/dask/dataframe/core.py

Lines 5591 to 5594 in 640df6b

args = _maybe_from_pandas(args)

args = _maybe_align_partitions(args)

dfs = [df for df in args if isinstance(df, _Frame)]

meta_index = getattr(make_meta(dfs[0]), "index", None) if dfs else None

rjzamora · 2021-05-20T22:45:56Z

dask/dataframe/core.py

                token=keyname,
                enforce_metadata=False,
                meta=(q, "f8"),
+                parent_meta=self._meta,


Another case where I'd like to avoid passing parent_meta if we can avoid it.

This case is an example where we have meta is (q, 'f8') and we cannot really know what the parents meta is.

rjzamora · 2021-05-20T22:54:55Z

dask/dataframe/core.py

    )

-    other_meta = make_meta(other)
+    other_meta = make_meta_util(other, parent_meta=self._parent_meta)


Can we not pass in parent_meta=self._meta here? Is the Scalar meta still too "general"?

Yes, for some scalars the meta seems to be too generic like int, hence the need to pass self._parent_meta here :

for example

> /nvme/0/pgali/cudf/dask/dask/dataframe/core.py(269)_scalar_binary() (Pdb) self._meta 1 (Pdb) self._parent_meta Series([], dtype: float64)

dask/dataframe/utils.py

rjzamora · 2021-05-20T23:19:46Z

dask/dataframe/io/io.py

+        meta = parent_meta
    else:
-        meta = make_meta(meta)
+        meta = make_meta_util(meta, parent_meta=parent_meta)


Is there a good reason to pass parent_meta if a meta object is already provided? That is, shouldn't meta already be an appropriate DataFrame type? My intuition tells me that the above changes should just be swapping make_meta with make_meta_util.

That is, shouldn't meta already be an appropriate DataFrame type?

Sadly, nope. For example:

> /nvme/0/pgali/cudf/dask/dask/dataframe/io/io.py(603)from_delayed() (Pdb) meta [('a', 'f8'), ('b', 'f8'), ('c', 'f8'), ('d', 'f8')] (Pdb) parent_meta Empty DataFrame Columns: [a, b, c, d] Index: []

rjzamora

Thank you for answering my questions @galipremsagar - The changes here seem reasonable to me. Thanks again for attacking this

dask/dataframe/backends.py

Co-authored-by: jakirkham <jakirkham@gmail.com>

dask/dataframe/backends.py

jakirkham · 2021-05-25T00:38:28Z

cc @jrbourbeau @jschendel (for thoughts as well)

Co-authored-by: jakirkham <jakirkham@gmail.com>

dask/dataframe/utils.py

Co-authored-by: jakirkham <jakirkham@gmail.com>

quasiben · 2021-05-25T17:18:03Z

I brought up this PR at the dask maintainer meeting and want to give other folks an opportunity to comment. If we don't hear back EOD I'll merge in if their are no additional comments

jakirkham · 2021-05-25T17:23:22Z

It would be good to hear back from people.

However RAPIDS CI is broken on multiple projects atm without this change, PR ( rapidsai/cudf#8342 ), and PR ( rapidsai/dask-cuda#623 ). These are all needed as a consequence of PR ( #7503 ) and PR ( #7505 ) having been merged yesterday. The longer the wait the more people will be blocked.

I would propose we go ahead and merge this and follow up on any concerns raised in a subsequent PR.

quasiben · 2021-05-25T17:27:50Z

Agreed, there is some urgency hear. What do you think about waiting another 30 minutes (until 2PM EST)? I think this is a big enough change that we need to give a little bit of buffer for other folks in case they have concerns

jakirkham · 2021-05-25T17:33:32Z

Will defer to you

Though it's worth noting this PR was originally submitted ~1 month ago. So it has already been around for a while

quasiben · 2021-05-25T17:59:04Z

Sorry for the overhead and thank you for the patience. Perhaps I was overly concerned

jakirkham · 2021-05-25T18:04:44Z

Thanks Ben 🙂

If people do find issues here, please let us know and we can follow up.

martinfleis · 2021-05-26T18:46:37Z

Hi, I just found that this change breaks the way we register dtypes in dask-geopandas. What is the recommended way to rewrite it in a backwards-compatible manner? See the failure and the implementation here for a reference.

I think that this may happen in some other downstream projects as well.

jakirkham · 2021-05-26T18:49:10Z

Thanks for raising @martinfleis! This is how we are handling it ( rapidsai/cudf#8368 )

Edit: More details here ( geopandas/dask-geopandas#48 (comment) )

galipremsagar · 2021-05-26T19:05:33Z

This is how we are handling it ( rapidsai/cudf#8368 )

This is how we would recommend handling for backward-compatibility with older versions of dask.

But, in addition dask-geopandas(or any downstream project) will have to have an implementation that registers with make_meta_obj dispatch and avoid invoking make_meta directly and instead rely on make_meta_util.

jakirkham · 2021-05-28T17:19:34Z

PR ( geopandas/dask-geopandas#47 ) updates dask-geopandas w.r.t. this change

galipremsagar added 6 commits April 19, 2021 11:52

inital changes

04752dc

Merge remote-tracking branch 'upstream/main' into 7946

471a6f1

Merge remote-tracking branch 'upstream/main' into 7946

162de9f

Merge remote-tracking branch 'upstream/main' into 7946

7599725

use parent_meta to determine the back-ends

3d03d46

black

ffe1973

github-actions bot added dataframe io labels Apr 21, 2021

isort

87b262f

galipremsagar mentioned this pull request May 11, 2021

dispatch.registers to their own file #7503

Merged

3 tasks

Merge branch 'main' of github.com:dask/dask into 7946

d7c6541

pentschev reviewed May 18, 2021

View reviewed changes

rjzamora reviewed May 18, 2021

View reviewed changes

galipremsagar added 2 commits May 19, 2021 06:38

Merge remote-tracking branch 'upstream/main' into 7946

f7d4a6e

sync upstream changes

f9de319

galipremsagar added 3 commits May 19, 2021 12:46

add scipy

b117659

add scipy in yml

ba6d611

Trigger run

99cec3a

rjzamora reviewed May 20, 2021

View reviewed changes

galipremsagar added 2 commits May 23, 2021 11:06

remove scipy mandatory dependency

737f0d0

format

cb70811

rjzamora approved these changes May 24, 2021

View reviewed changes

galipremsagar added 2 commits May 24, 2021 14:00

Merge remote-tracking branch 'upstream/main' into 7946

25e0c28

merge and generalize categories related code

a40de33

jakirkham reviewed May 25, 2021

View reviewed changes

dask/dataframe/backends.py Outdated Show resolved Hide resolved

Update dask/dataframe/backends.py

e4a6de9

Co-authored-by: jakirkham <jakirkham@gmail.com>

jakirkham reviewed May 25, 2021

View reviewed changes

dask/dataframe/backends.py Outdated Show resolved Hide resolved

galipremsagar mentioned this pull request May 25, 2021

[REVIEW] Add support for make_meta_obj dispatch in dask-cudf rapidsai/cudf#8342

Merged

Update dask/dataframe/backends.py

1bfe464

Co-authored-by: jakirkham <jakirkham@gmail.com>

jakirkham reviewed May 25, 2021

View reviewed changes

dask/dataframe/utils.py Outdated Show resolved Hide resolved

galipremsagar and others added 2 commits May 25, 2021 11:40

Update dask/dataframe/utils.py

b703c12

Co-authored-by: jakirkham <jakirkham@gmail.com>

Merge branch 'main' into 7946

cf72028

jakirkham approved these changes May 25, 2021

View reviewed changes

quasiben merged commit d2020bf into dask:main May 25, 2021

martinfleis mentioned this pull request May 26, 2021

BUG: Failure with dask master geopandas/dask-geopandas#48

Closed

quasiben mentioned this pull request May 26, 2021

Release 2021.05.1 dask/community#159

Closed

tomwhite mentioned this pull request May 31, 2021

Fix change in make_meta in Dask 2021.5.1 #590 sgkit-dev/sgkit#591

Merged

GenevieveBuckley mentioned this pull request Jun 2, 2021

Rename make_meta_util to make_meta #7743

Merged

3 tasks

This was referenced Jun 7, 2021

Evaluate graph lazily if meta is provided in from_delayed #7769

Merged

Test failures in the Arrow nightly integration builds JDASoftwareGroup/kartothek#475

Closed

steffen-schroeder-by mentioned this pull request Jun 22, 2021

Fix test by matching error messages appearing from dask > 2021.5.0, fix documentation build and prepare release 5.0 JDASoftwareGroup/kartothek#482

Merged



		def new_dd_object(dsk, name, meta, divisions):
		def new_dd_object(dsk, name, meta, divisions, parent_meta=None):

	return new_dd_object(graph, b, meta, divisions, parent_meta=dfs[0]._meta)
	return new_dd_object(graph, b, meta, divisions)

	args = _maybe_from_pandas(args)
	args = _maybe_align_partitions(args)
	dfs = [df for df in args if isinstance(df, _Frame)]
	meta_index = getattr(make_meta(dfs[0]), "index", None) if dfs else None

Uh oh!

Conversation

galipremsagar commented Apr 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quasiben commented May 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pentschev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

galipremsagar commented May 19, 2021

Uh oh!

rjzamora left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora May 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

galipremsagar May 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jakirkham commented May 25, 2021

Uh oh!

Uh oh!

quasiben commented May 25, 2021

Uh oh!

jakirkham commented May 25, 2021

Uh oh!

quasiben commented May 25, 2021

Uh oh!

galipremsagar commented Apr 21, 2021 •

edited

Loading

quasiben commented May 18, 2021 •

edited

Loading

rjzamora May 20, 2021 •

edited

Loading

galipremsagar May 23, 2021 •

edited

Loading

jakirkham commented May 26, 2021 •

edited

Loading

galipremsagar commented May 26, 2021 •

edited

Loading