Skip to content

Fix meta creation incase of object#7586

Merged
quasiben merged 21 commits intodask:mainfrom
galipremsagar:7946
May 25, 2021
Merged

Fix meta creation incase of object#7586
quasiben merged 21 commits intodask:mainfrom
galipremsagar:7946

Conversation

@galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Apr 21, 2021

There is an issue rapidsai/cudf#7946, where the metadata creations ends up being something that is purely based on the order of importing a backend instead of the correct backend itself.

So the root cause to the above issue is that we have make_meta dispatch registered both in dask and dask-cudf, against the same type object, wherein the Dispatch class will end up storing the function of the last/most-recently registered dispatch only(since its a simple dict). In this PR I have made a new utility function by the name make_meta_util, and a new dispatch that is responsible for object meta creation but is registered against object of a specific backend, this way we can guarantee the correct metadata is being generated and in-turn the right backend APIs are invoked.

One additional thing we would have to do to facilitate this change is that we would have to pass in parent_meta instead of meta as we would want to know the real API back-end which we would want to invoke, because some pandas APIs return numpy objects which are then stored as meta and will not be really helpful for us to determine which back-end to hit, but when we store/pass parent_meta to this utility that will help us to determine correctly the backend that is needed and accordingly dispatched.
cc: @jakirkham @quasiben @rjzamora @beckernick @kkraus14

  • Tests added / passed
  • Passes black dask / flake8 dask / isort dask

@quasiben
Copy link
Member

quasiben commented May 18, 2021

@galipremsagar apologies for letting this slip. @pentschev would you have time to review this ?

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, I've added a minor suggestion. However, DataFrame meta is a bit different from Array meta, which I'm not totally familiar with, so maybe would be good to get a review from someone more knowledgeable as well, perhaps @rjzamora ?

def make_meta_util(x, index=None, parent_meta=None):
import dask.dataframe as dd

if isinstance(x, (dd.core.Series, dd.core.DataFrame)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dd.core.Series and dd.core.DataFrame already have _meta, no? In that case this may be redundant and could be removed in favor of the condition immediately below: if hasattr(x, "_meta").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, make_meta_object does something like the following to capture these cases:

    if hasattr(x, "_meta"):
        return x._meta
    elif is_arraylike(x) and x.shape:
        return x[:0]

However, if that logic is already captured in make_meta_object (which is now registered with make_meta_obj), it doesn't seem like this check is necessary at all. Is this correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I think your assessment is right, Rick. Unless I'm missing something, we could indeed remove both checks here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a redundant check. I got rid of it in make_meta_object. The reason being since make_meta dispatch was registered against generic python object type, all objects would simply pass through the make_meta, but with the change being made this won't happen so it will be the task of make_meta_util to handle this. The is_arraylike check however shouldn't be done in make_meta_util as objects like pd.Series/cudf.Series need to go through the dispatch mechanism below.

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @galipremsagar !

I am sorry for being late to the party here, but I'd like to clarify the solution a bit before signing off...

My understanding is that the general @make_meta.register(object) definition in dask-cudf (which is registered upon import) is incorrectly taking over when cudf is not even in use. If the make_meta dispatching was being performed in the way that concat is used throughout Dask-Dataframe, then there would be a "middle-man" function where we could add a kwarg like parent_meta=. However, Dask-Dataframe is using the dispatch name directly (make_meta), and so you are effectively adding this "middle-man" function under the name make_meta_util (and using it throughout the code base). Is this correct?

If so, then I think this solution makes sense -- At least, I cannot think of a better way to do it :)

return x._meta

try:
return make_meta(x, index=index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we still need to remove the make_meta.register(object) definition in dask-cudf before the original issue is resolved? Not a problem if so, I am just trying to understand everything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. We'll need dask-cudf side changes to accommodate this new dispatch - make_meta_object.

def make_meta_util(x, index=None, parent_meta=None):
import dask.dataframe as dd

if isinstance(x, (dd.core.Series, dd.core.DataFrame)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, make_meta_object does something like the following to capture these cases:

    if hasattr(x, "_meta"):
        return x._meta
    elif is_arraylike(x) and x.shape:
        return x[:0]

However, if that logic is already captured in make_meta_object (which is now registered with make_meta_obj), it doesn't seem like this check is necessary at all. Is this correct?

@galipremsagar
Copy link
Contributor Author

Thank you for working on this @galipremsagar !

I am sorry for being late to the party here, but I'd like to clarify the solution a bit before signing off...

My understanding is that the general @make_meta.register(object) definition in dask-cudf (which is registered upon import) is incorrectly taking over when cudf is not even in use. If the make_meta dispatching was being performed in the way that concat is used throughout Dask-Dataframe, then there would be a "middle-man" function where we could add a kwarg like parent_meta=. However, Dask-Dataframe is using the dispatch name directly (make_meta), and so you are effectively adding this "middle-man" function under the name make_meta_util (and using it throughout the code base). Is this correct?

If so, then I think this solution makes sense -- At least, I cannot think of a better way to do it :)

@rjzamora yes, you got it right.

However, I'm adding a scipy import in utils.py, which is causing the github test_imports to fail. Any idea where all do we need to add this package to? @rjzamora @jakirkham

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Prem! This is looking good.

I am wondering if we can make the changes a bit lighter if we: (1) Avoid changing to the new_dd_object function signature, (2) Avoid making scipy a requirement, (3) avoid passing parent_meta to map_partitions, and (4) avoid passing parent_meta in places where an already-provided meta object is sufficient.

There is a perfectly good chance that I am misunderstanding the changes. So, feel free to push back on any of my comments/suggestions :)



def new_dd_object(dsk, name, meta, divisions):
def new_dd_object(dsk, name, meta, divisions, parent_meta=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def new_dd_object(dsk, name, meta, divisions, parent_meta=None):
def new_dd_object(dsk, name, meta, divisions):

Doesn't look like this change is necessary (I don't think parent_meta is used in new_dd_object)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$META
"""
name = kwargs.pop("token", None)
parent_meta = kwargs.pop("parent_meta", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there expected cases where a user would need the option of passing this in. Is there an expected reason that the _meta of the first _Frame in args is not a good assumption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be similar to the above one, when _Frame list can be all empty or the args has no _Frame at all and the meta at hand is way too generic.

divisions = [None] * (split_out + 1)

return new_dd_object(graph, b, meta, divisions)
return new_dd_object(graph, b, meta, divisions, parent_meta=dfs[0]._meta)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return new_dd_object(graph, b, meta, divisions, parent_meta=dfs[0]._meta)
return new_dd_object(graph, b, meta, divisions)

I don't think we need this, but I may be missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar issue here, meta can be too generic :

> /nvme/0/pgali/cudf/dask/dask/dataframe/core.py(5535)apply_concat_apply()
(Pdb) dfs[0]._meta
Series([], dtype: int64)
(Pdb) meta
1

)
warnings.warn(meta_warning(meta))

kwds.update({"parent_meta": self._meta})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we didn't need to pass parent metadata into map_partitions since we could easily use the _meta on the first _Frame object within that function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason map_partitions would need a parent_meta is dfs frame objects list can be empty too:

dask/dask/dataframe/core.py

Lines 5591 to 5594 in 640df6b

args = _maybe_from_pandas(args)
args = _maybe_align_partitions(args)
dfs = [df for df in args if isinstance(df, _Frame)]
meta_index = getattr(make_meta(dfs[0]), "index", None) if dfs else None

token=keyname,
enforce_metadata=False,
meta=(q, "f8"),
parent_meta=self._meta,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case where I'd like to avoid passing parent_meta if we can avoid it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is an example where we have meta is (q, 'f8') and we cannot really know what the parents meta is.

)

other_meta = make_meta(other)
other_meta = make_meta_util(other, parent_meta=self._parent_meta)
Copy link
Member

@rjzamora rjzamora May 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not pass in parent_meta=self._meta here? Is the Scalar meta still too "general"?

Copy link
Contributor Author

@galipremsagar galipremsagar May 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for some scalars the meta seems to be too generic like int, hence the need to pass self._parent_meta here :

for example

> /nvme/0/pgali/cudf/dask/dask/dataframe/core.py(269)_scalar_binary()
(Pdb) self._meta
1
(Pdb) self._parent_meta
Series([], dtype: float64)

meta = parent_meta
else:
meta = make_meta(meta)
meta = make_meta_util(meta, parent_meta=parent_meta)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason to pass parent_meta if a meta object is already provided? That is, shouldn't meta already be an appropriate DataFrame type? My intuition tells me that the above changes should just be swapping make_meta with make_meta_util.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, shouldn't meta already be an appropriate DataFrame type?

Sadly, nope. For example:

> /nvme/0/pgali/cudf/dask/dask/dataframe/io/io.py(603)from_delayed()
(Pdb) meta
[('a', 'f8'), ('b', 'f8'), ('c', 'f8'), ('d', 'f8')]
(Pdb) parent_meta
Empty DataFrame
Columns: [a, b, c, d]
Index: []

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for answering my questions @galipremsagar - The changes here seem reasonable to me. Thanks again for attacking this

Co-authored-by: jakirkham <jakirkham@gmail.com>
@jakirkham
Copy link
Member

cc @jrbourbeau @jschendel (for thoughts as well)

Co-authored-by: jakirkham <jakirkham@gmail.com>
galipremsagar and others added 2 commits May 25, 2021 11:40
Co-authored-by: jakirkham <jakirkham@gmail.com>
@quasiben
Copy link
Member

I brought up this PR at the dask maintainer meeting and want to give other folks an opportunity to comment. If we don't hear back EOD I'll merge in if their are no additional comments

@jakirkham
Copy link
Member

It would be good to hear back from people.

However RAPIDS CI is broken on multiple projects atm without this change, PR ( rapidsai/cudf#8342 ), and PR ( rapidsai/dask-cuda#623 ). These are all needed as a consequence of PR ( #7503 ) and PR ( #7505 ) having been merged yesterday. The longer the wait the more people will be blocked.

I would propose we go ahead and merge this and follow up on any concerns raised in a subsequent PR.

@quasiben
Copy link
Member

Agreed, there is some urgency hear. What do you think about waiting another 30 minutes (until 2PM EST)? I think this is a big enough change that we need to give a little bit of buffer for other folks in case they have concerns

@jakirkham
Copy link
Member

Will defer to you

Though it's worth noting this PR was originally submitted ~1 month ago. So it has already been around for a while

@quasiben
Copy link
Member

Sorry for the overhead and thank you for the patience. Perhaps I was overly concerned

@quasiben quasiben merged commit d2020bf into dask:main May 25, 2021
@jakirkham
Copy link
Member

Thanks Ben 🙂

If people do find issues here, please let us know and we can follow up.

@martinfleis
Copy link
Contributor

Hi, I just found that this change breaks the way we register dtypes in dask-geopandas. What is the recommended way to rewrite it in a backwards-compatible manner? See the failure and the implementation here for a reference.

I think that this may happen in some other downstream projects as well.

@jakirkham
Copy link
Member

jakirkham commented May 26, 2021

Thanks for raising @martinfleis! This is how we are handling it ( rapidsai/cudf#8368 )

Edit: More details here ( geopandas/dask-geopandas#48 (comment) )

@galipremsagar
Copy link
Contributor Author

galipremsagar commented May 26, 2021

This is how we are handling it ( rapidsai/cudf#8368 )

This is how we would recommend handling for backward-compatibility with older versions of dask.

But, in addition dask-geopandas(or any downstream project) will have to have an implementation that registers with make_meta_obj dispatch and avoid invoking make_meta directly and instead rely on make_meta_util.

@jakirkham
Copy link
Member

PR ( geopandas/dask-geopandas#47 ) updates dask-geopandas w.r.t. this change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants