Fix for normalize_function dataclass methods#8527
Conversation
scharlottej13
commented
Jan 3, 2022
- see what fails w/ this change (see pure delayed gives wrong result for dataclass methods #8376)
|
The failed checks seem limited to python 3.7, so this change seems 'safe'? Would still love to know why |
ian-r-rose
left a comment
There was a problem hiding this comment.
This seems fairly safe to me, but we should add a test for the original issue to prevent regressions.
dask/base.py
Outdated
| import cloudpickle | ||
|
|
||
| return cloudpickle.dumps(func, protocol=0) | ||
| return cloudpickle.dumps(func, protocol=None) |
There was a problem hiding this comment.
We want to make sure the version of pickle used for tokenizing is identical across all versions of python that we care about. In this case specifying protocol=None means to use the default version of pickle for the provided python, which can change in a new release (creating an incompatibility). I'd rather we hardcode protocol 4 (supported for python >= 3.4) here so that if a new python release makes protocol 5 the default things don't implicitly change.
There was a problem hiding this comment.
That makes sense, thanks for the explanation Jim! I'll change this and also add in a test per @ian-r-rose's suggestion (sorry for the delay on that!)
There was a problem hiding this comment.
From reading this thread on #8557 it seems this should be changed to protocol=4 in both cloudpickle.dumps calls
There was a problem hiding this comment.
actually, I think this test Gabe added addresses the original issue
There was a problem hiding this comment.
@scharlottej13 does it? I had thought it wasn't quite the same thing #8557 (comment)
There was a problem hiding this comment.
@gjoseph92 the test I was going to add was to create a new DataClass and then call delayed on a function in that class, but really that'd be testing if it's tokenizing correctly, right? I guess the part that might still be missing is testing with pure=True, ie this snippet with pure=False works with the current protocol=0, but doesn't with pure=True:
from dask import delayed
from dataclasses import dataclass, field
@dataclass(frozen=True)
class A:
param: float = field(repr=False)
def get_param(self):
return self.param
def get_delayed_param(self, *args, **kwargs):
return delayed(self.get_param)(*args, **kwargs)
(A(1).get_delayed_param() - A(0).get_delayed_param()).compute()There was a problem hiding this comment.
summary from chat w/ @gjoseph92-- it does seem that the test in #8557 will also test the underlying reason for the original issue (tokenizing DataClasses), but I'll add in a simple test that more explicitly addresses the original issue since there is the additional layer of having a bound method.
|
Seems like you've got some genuine failures here. |
Thanks for flagging @jsignell, I'll have a look! |
|
See some discussion in #8557 (comment). We'd have to change some tests for this to pass. I mostly agree with @ian-r-rose that these aren't very good tests. But they are explicitly testing that tokenization is stable across versions, which isn't a guarantee I knew we made. Since we'd be explicitly breaking that, we should be thoughtful about why there's a test for it first. |
|
Note @jcrist's comment on changing the pickle protocol: #8557 (comment). So we should be good to change the protocol and make those tests more sensible. |
Thanks @gjoseph92! Does this mean you're planning to change the tests in #8557, and we should wait for this PR until those are in? |
|
I was thinking we'd change the protocol and the tests here? |
Ah ok, sounds good, sorry I misunderstood that! |
|
@gjoseph92 the tests I'm seeing that need to be changed (from looking at CI failures) are: Is this right? From @jcrist's comment it seems like there are more tests we'd want to change as well, but I'm not exactly sure which. |
|
@scharlottej13 those are the only ones I know of (and the only ones that seem to be failing right now). I'd say, if it's green, we're good. |
|
It looks like there are maybe some test left to change on this and then it'll be good to go? |
yes! I can work on those today |
normalize_function dataclass methods
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks @scharlottej13. One small style nit, but otherwise this looks good to go
dask/tests/test_base.py
Outdated
|
|
||
|
|
||
| def test_normalize_function_dataclass_field_no_repr(): | ||
| from dataclasses import field, make_dataclass |
There was a problem hiding this comment.
Nitpick: Could you move these imports to the top of this module along with the other dataclasses imports we already have
Line 8 in 3d0cd48
| b"\x80\x04\x95\x1f\x00\x00\x00\x00\x00\x00\x00\x8c\x14dask.tests.test_base\x94\x8c\x02f3\x94\x93\x94.", | ||
| ( | ||
| b"\x80\x04\x95\x1f\x00\x00\x00\x00\x00\x00\x00\x8c\x14dask.tests.test_base\x94\x8c\x02f2\x94\x93\x94.", | ||
| ), | ||
| ( | ||
| ( | ||
| "c", | ||
| b"\x80\x04\x95\x1f\x00\x00\x00\x00\x00\x00\x00\x8c\x14dask.tests.test_base\x94\x8c\x02f1\x94\x93\x94.", | ||
| ), | ||
| ), |
There was a problem hiding this comment.
I agree it would be nice to make this (and other similar tests) less brittle, but it seems there aren't any suggestions at the moment on how to do that. In my mind what you have here is brittle in the same way as what's already on main. Given this, I'm inclined to merge this in as is and improve tests in a follow up PR if folks think of a nice way to do so.
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks @scharlottej13 -- this is in