Fix meta creation incase of object#7586
Conversation
|
@galipremsagar apologies for letting this slip. @pentschev would you have time to review this ? |
pentschev
left a comment
There was a problem hiding this comment.
Overall looks good, I've added a minor suggestion. However, DataFrame meta is a bit different from Array meta, which I'm not totally familiar with, so maybe would be good to get a review from someone more knowledgeable as well, perhaps @rjzamora ?
dask/dataframe/utils.py
Outdated
| def make_meta_util(x, index=None, parent_meta=None): | ||
| import dask.dataframe as dd | ||
|
|
||
| if isinstance(x, (dd.core.Series, dd.core.DataFrame)): |
There was a problem hiding this comment.
I think dd.core.Series and dd.core.DataFrame already have _meta, no? In that case this may be redundant and could be removed in favor of the condition immediately below: if hasattr(x, "_meta").
There was a problem hiding this comment.
Right, make_meta_object does something like the following to capture these cases:
if hasattr(x, "_meta"):
return x._meta
elif is_arraylike(x) and x.shape:
return x[:0]However, if that logic is already captured in make_meta_object (which is now registered with make_meta_obj), it doesn't seem like this check is necessary at all. Is this correct?
There was a problem hiding this comment.
Good catch. I think your assessment is right, Rick. Unless I'm missing something, we could indeed remove both checks here.
There was a problem hiding this comment.
Yes, this is a redundant check. I got rid of it in make_meta_object. The reason being since make_meta dispatch was registered against generic python object type, all objects would simply pass through the make_meta, but with the change being made this won't happen so it will be the task of make_meta_util to handle this. The is_arraylike check however shouldn't be done in make_meta_util as objects like pd.Series/cudf.Series need to go through the dispatch mechanism below.
rjzamora
left a comment
There was a problem hiding this comment.
Thank you for working on this @galipremsagar !
I am sorry for being late to the party here, but I'd like to clarify the solution a bit before signing off...
My understanding is that the general @make_meta.register(object) definition in dask-cudf (which is registered upon import) is incorrectly taking over when cudf is not even in use. If the make_meta dispatching was being performed in the way that concat is used throughout Dask-Dataframe, then there would be a "middle-man" function where we could add a kwarg like parent_meta=. However, Dask-Dataframe is using the dispatch name directly (make_meta), and so you are effectively adding this "middle-man" function under the name make_meta_util (and using it throughout the code base). Is this correct?
If so, then I think this solution makes sense -- At least, I cannot think of a better way to do it :)
dask/dataframe/utils.py
Outdated
| return x._meta | ||
|
|
||
| try: | ||
| return make_meta(x, index=index) |
There was a problem hiding this comment.
Does this mean we still need to remove the make_meta.register(object) definition in dask-cudf before the original issue is resolved? Not a problem if so, I am just trying to understand everything.
There was a problem hiding this comment.
Correct. We'll need dask-cudf side changes to accommodate this new dispatch - make_meta_object.
dask/dataframe/utils.py
Outdated
| def make_meta_util(x, index=None, parent_meta=None): | ||
| import dask.dataframe as dd | ||
|
|
||
| if isinstance(x, (dd.core.Series, dd.core.DataFrame)): |
There was a problem hiding this comment.
Right, make_meta_object does something like the following to capture these cases:
if hasattr(x, "_meta"):
return x._meta
elif is_arraylike(x) and x.shape:
return x[:0]However, if that logic is already captured in make_meta_object (which is now registered with make_meta_obj), it doesn't seem like this check is necessary at all. Is this correct?
@rjzamora yes, you got it right. However, I'm adding a |
rjzamora
left a comment
There was a problem hiding this comment.
Thanks Prem! This is looking good.
I am wondering if we can make the changes a bit lighter if we: (1) Avoid changing to the new_dd_object function signature, (2) Avoid making scipy a requirement, (3) avoid passing parent_meta to map_partitions, and (4) avoid passing parent_meta in places where an already-provided meta object is sufficient.
There is a perfectly good chance that I am misunderstanding the changes. So, feel free to push back on any of my comments/suggestions :)
|
|
||
|
|
||
| def new_dd_object(dsk, name, meta, divisions): | ||
| def new_dd_object(dsk, name, meta, divisions, parent_meta=None): |
There was a problem hiding this comment.
| def new_dd_object(dsk, name, meta, divisions, parent_meta=None): | |
| def new_dd_object(dsk, name, meta, divisions): |
Doesn't look like this change is necessary (I don't think parent_meta is used in new_dd_object)
There was a problem hiding this comment.
Needed for this: https://github.com/dask/dask/pull/7586/files#r637579242
| $META | ||
| """ | ||
| name = kwargs.pop("token", None) | ||
| parent_meta = kwargs.pop("parent_meta", None) |
There was a problem hiding this comment.
Are there expected cases where a user would need the option of passing this in. Is there an expected reason that the _meta of the first _Frame in args is not a good assumption?
There was a problem hiding this comment.
This would be similar to the above one, when _Frame list can be all empty or the args has no _Frame at all and the meta at hand is way too generic.
| divisions = [None] * (split_out + 1) | ||
|
|
||
| return new_dd_object(graph, b, meta, divisions) | ||
| return new_dd_object(graph, b, meta, divisions, parent_meta=dfs[0]._meta) |
There was a problem hiding this comment.
| return new_dd_object(graph, b, meta, divisions, parent_meta=dfs[0]._meta) | |
| return new_dd_object(graph, b, meta, divisions) |
I don't think we need this, but I may be missing something.
There was a problem hiding this comment.
Similar issue here, meta can be too generic :
> /nvme/0/pgali/cudf/dask/dask/dataframe/core.py(5535)apply_concat_apply()
(Pdb) dfs[0]._meta
Series([], dtype: int64)
(Pdb) meta
1| ) | ||
| warnings.warn(meta_warning(meta)) | ||
|
|
||
| kwds.update({"parent_meta": self._meta}) |
There was a problem hiding this comment.
It would be nice if we didn't need to pass parent metadata into map_partitions since we could easily use the _meta on the first _Frame object within that function.
There was a problem hiding this comment.
The reason map_partitions would need a parent_meta is dfs frame objects list can be empty too:
Lines 5591 to 5594 in 640df6b
| token=keyname, | ||
| enforce_metadata=False, | ||
| meta=(q, "f8"), | ||
| parent_meta=self._meta, |
There was a problem hiding this comment.
Another case where I'd like to avoid passing parent_meta if we can avoid it.
There was a problem hiding this comment.
This case is an example where we have meta is (q, 'f8') and we cannot really know what the parents meta is.
| ) | ||
|
|
||
| other_meta = make_meta(other) | ||
| other_meta = make_meta_util(other, parent_meta=self._parent_meta) |
There was a problem hiding this comment.
Can we not pass in parent_meta=self._meta here? Is the Scalar meta still too "general"?
There was a problem hiding this comment.
Yes, for some scalars the meta seems to be too generic like int, hence the need to pass self._parent_meta here :
for example
> /nvme/0/pgali/cudf/dask/dask/dataframe/core.py(269)_scalar_binary()
(Pdb) self._meta
1
(Pdb) self._parent_meta
Series([], dtype: float64)
| meta = parent_meta | ||
| else: | ||
| meta = make_meta(meta) | ||
| meta = make_meta_util(meta, parent_meta=parent_meta) |
There was a problem hiding this comment.
Is there a good reason to pass parent_meta if a meta object is already provided? That is, shouldn't meta already be an appropriate DataFrame type? My intuition tells me that the above changes should just be swapping make_meta with make_meta_util.
There was a problem hiding this comment.
That is, shouldn't
metaalready be an appropriate DataFrame type?
Sadly, nope. For example:
> /nvme/0/pgali/cudf/dask/dask/dataframe/io/io.py(603)from_delayed()
(Pdb) meta
[('a', 'f8'), ('b', 'f8'), ('c', 'f8'), ('d', 'f8')]
(Pdb) parent_meta
Empty DataFrame
Columns: [a, b, c, d]
Index: []
rjzamora
left a comment
There was a problem hiding this comment.
Thank you for answering my questions @galipremsagar - The changes here seem reasonable to me. Thanks again for attacking this
Co-authored-by: jakirkham <jakirkham@gmail.com>
|
cc @jrbourbeau @jschendel (for thoughts as well) |
Co-authored-by: jakirkham <jakirkham@gmail.com>
Co-authored-by: jakirkham <jakirkham@gmail.com>
|
I brought up this PR at the dask maintainer meeting and want to give other folks an opportunity to comment. If we don't hear back EOD I'll merge in if their are no additional comments |
|
It would be good to hear back from people. However RAPIDS CI is broken on multiple projects atm without this change, PR ( rapidsai/cudf#8342 ), and PR ( rapidsai/dask-cuda#623 ). These are all needed as a consequence of PR ( #7503 ) and PR ( #7505 ) having been merged yesterday. The longer the wait the more people will be blocked. I would propose we go ahead and merge this and follow up on any concerns raised in a subsequent PR. |
|
Agreed, there is some urgency hear. What do you think about waiting another 30 minutes (until 2PM EST)? I think this is a big enough change that we need to give a little bit of buffer for other folks in case they have concerns |
|
Will defer to you Though it's worth noting this PR was originally submitted ~1 month ago. So it has already been around for a while |
|
Sorry for the overhead and thank you for the patience. Perhaps I was overly concerned |
|
Thanks Ben 🙂 If people do find issues here, please let us know and we can follow up. |
|
Thanks for raising @martinfleis! This is how we are handling it ( rapidsai/cudf#8368 ) Edit: More details here ( geopandas/dask-geopandas#48 (comment) ) |
This is how we would recommend handling for backward-compatibility with older versions of dask. But, in addition |
|
PR ( geopandas/dask-geopandas#47 ) updates dask-geopandas w.r.t. this change |
There is an issue rapidsai/cudf#7946, where the metadata creations ends up being something that is purely based on the order of importing a backend instead of the correct backend itself.
So the root cause to the above issue is that we have
make_metadispatch registered both in dask and dask-cudf, against the same typeobject, wherein the Dispatch class will end up storing the function of the last/most-recently registered dispatch only(since its a simple dict). In this PR I have made a new utility function by the namemake_meta_util, and a new dispatch that is responsible for object meta creation but is registered against object of a specific backend, this way we can guarantee the correct metadata is being generated and in-turn the right backend APIs are invoked.One additional thing we would have to do to facilitate this change is that we would have to pass in
parent_metainstead ofmetaas we would want to know the real API back-end which we would want to invoke, because some pandas APIs return numpy objects which are then stored asmetaand will not be really helpful for us to determine which back-end to hit, but when we store/passparent_metato this utility that will help us to determine correctly the backend that is needed and accordingly dispatched.cc: @jakirkham @quasiben @rjzamora @beckernick @kkraus14
black dask/flake8 dask/isort dask