Conversation
|
It's also probably worth checking this against #8376 |
|
Thanks @ian-r-rose, I have a feeling this PR will end up fixing that. |
|
I spoke too soon, this doesn't fix #8376, because the thing being tokenized in that case is not a dataclass, but the bound method of a dataclass (totally different code path). I don't even know how much that even relates to dataclasses, versus just some issue in cloudpickle that could affect other things too. Seems like @scharlottej13 is already on the right track with that one. |
|
Looks like there was a hiccup in CI -- just pushed a commit which merges |
dask/base.py
Outdated
| # If dataclasses should not be compared, they should not be deterministically | ||
| # tokenized either, since the purpose of tokenization is comparision. | ||
| try: | ||
| comparable = obj.__dataclass_params__.eq |
There was a problem hiding this comment.
I could be missing something, but I didn't see any documentation around __dataclass_params__ in the dataclasses docs. Is there a more public-facing functionality we could use here instead? I suppose with suitable test coverage __dataclass_params__ is fine, but wanted to at least point this out
There was a problem hiding this comment.
I didn't love this either, since I do think it's an internal API. There doesn't appear to be a public API for getting this sort of information.
The other more-public option I was considering was type(obj).__dict__.get("__eq__"). This is testing whether the class itself (not parents) implements an __eq__ method, because:
eq: If true (the default), an__eq__()method will be generated
https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass
However, I didn't like this, because it'll be a false-negative on cases like:
@dataclass(eq=True)
class A:
foo: str
class B(A):
def some_extra_method(self): ...In this case, dataclasses.is_dataclass(B) is True, since it inherits from one. But it doesn't have an __eq__ method in its own __dict__, since it just inherits it from A, versus being generated by make_dataclass. But since it inherits this __eq__ method, we'd want tokenize on B to be deterministic, just as it is on A.
I suppose we could test for type(obj).__eq__ is not object.__eq__, indicating that something in the MRO defines its own __eq__ method.
But frankly, this feels to me like it's relying about as much on implementation details as obj.__dataclass_params__.eq, and it's quite a bit harder to read and understand.
The last option is to just not care about __eq__, and deterministically tokenize dataclasses even if they're marked as eq=False. I think this would still be pretty reasonable. I'm kind of on the fence about whether it even matters.
There was a problem hiding this comment.
The last option is to just not care about eq, and deterministically tokenize dataclasses even if they're marked as eq=False. I think this would still be pretty reasonable. I'm kind of on the fence about whether it even matters.
Yeah, I wouldn't worry about this at all. tokenize isn't really about equality (and we don't mandate tokenize-able types implement equality anywhere else), I'd just skip this check.
There was a problem hiding this comment.
Sure. I just felt like if someone went to the trouble of saying eq=False on their dataclass, there is probably a good reason for it. That reason is probably because the contents don't/shouldn't support equality, which means tokenize on those elements will probably be nondeterministic, so it'll probably be fine if we ignore eq=False and traverse anyway. I'd just prefer not making assumptions like this and respecting what users set explicitly.
I agree we don't care about it anywhere else, but tokenize also doesn't automatically support any other user-defined types. Up until now, any user-defined classes would just get nondeterministic tokens unless you explicitly added __dask_tokenize__, so this feels like a special case. And incorrectly giving something a deterministic token that shouldn't have one can cause errors that are very hard for users to debug. As far as I'm aware, tokenize doesn't deterministically support any types that don't support __eq__ right now.
tokenizeisn't really about equality
To me it is though—it's about being able to tell if two keys represent equivalent results. Not even within one graph, but across all graphs in a session (caching uses it this way, at least).
There was a problem hiding this comment.
As far as I'm aware, tokenize doesn't deterministically support any types that don't support eq right now.
Sure it does. Numpy arrays for example can be equivalent, but override __eq__ to do their own thing that means that a == b isn't a valid equivalency check. Sometimes you may want to set eq=False to fallback to identity checks for equality, since the builtin dataclass comparison function would error:
In [1]: import dataclasses
In [2]: import numpy as np
In [3]: @dataclasses.dataclass
...: class Add:
...: x: np.ndarray
...: y: np.ndarray
...:
In [4]: op1 = Add(np.array([1, 2]), np.array([3, 4]))
In [5]: op2 = Add(np.array([1, 2]), np.array([3, 4]))
In [6]: op1 == op2
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-b6827372ab5d> in <module>
----> 1 op1 == op2
<string> in __eq__(self, other)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()foo(x1) == foo(x2) may be related to bool(x1 == x2), but one doesn't imply the other (in either direction). Some objects don't even compare equal to themselves!
In [13]: np.nan == np.nan
Out[13]: FalseForgetting the philosophical arguments above - we already traverse builtin collections & all dataclasses to find dask objects within them (viewing them more like declarative typed dicts than true user-defined classes). It's simpler to say "dask natively traverses and tokenizes builtin types & dataclasses" than to add a qualifier for tokenization for dataclasses that don't support __eq__. But I also don't expect this to ever really come up in a real user use case, so if you feel that strongly about it up to you.
There was a problem hiding this comment.
Ok, I'm somewhat convinced by setting eq=False just for convenience, even when logically it should support equality, just because the contents don't support boolean equality and you don't feel like writing a custom __eq__ method.
What do you think about fields specified as compare=False though?
There was a problem hiding this comment.
A field specified as compare=False may be there to hold something that doesn't affect equality (like a cache or something), so I could see an argument for skipping those when tokenizing things, but honestly I'd just do the dumb simple thing and traverse all fields for tokenization. If a user needs special behavior (some fields ignored, etc...), they can implement __dask_tokenize__ themselves. It's easy to explain, easy to document, and matches the traversal behavior we do elsewhere already.
There was a problem hiding this comment.
Ok, it's no longer caring about comparability at all.
dask/base.py
Outdated
| # If dataclasses should not be compared, they should not be deterministically | ||
| # tokenized either, since the purpose of tokenization is comparision. | ||
| try: | ||
| comparable = obj.__dataclass_params__.eq |
There was a problem hiding this comment.
The last option is to just not care about eq, and deterministically tokenize dataclasses even if they're marked as eq=False. I think this would still be pretty reasonable. I'm kind of on the fence about whether it even matters.
Yeah, I wouldn't worry about this at all. tokenize isn't really about equality (and we don't mandate tokenize-able types implement equality anywhere else), I'd just skip this check.
These are failing because cloudpickle protocol=0 can't pickle dataclasses. So we fall back to using using `str(func)`, which is exactly what we wanted to avoid.
|
I added |
|
Looks like there are at least 2 tests that were depending on protocol 0: dask/dask/tests/test_delayed.py Lines 542 to 550 in bdde5e1 Lines 224 to 232 in bdde5e1 I can update them, but I do worry somewhat about who's depending on tokenization stability. |
I've gotta say, neither of those look like particularly good tests to me -- I'd consider long-term stability of tokenization to be not guaranteed, and anyone relying on it probably shouldn't. To me, a better test would be to ensure that the tokenization is repeatable, as the first one does, and that a slight variation in the args results in something different (like would have caught #8450 ) |
|
Yeah, we make no guarantees that updates to
|
|
Think this is good to merge |
|
Thanks for sticking with this @gjoseph92! |
Split out from #8392
pre-commit run --all-files