Skip to content

Tokenize dataclasses#8557

Merged
jsignell merged 12 commits intodask:mainfrom
gjoseph92:tokenize-dataclasses
Feb 24, 2022
Merged

Tokenize dataclasses#8557
jsignell merged 12 commits intodask:mainfrom
gjoseph92:tokenize-dataclasses

Conversation

@gjoseph92
Copy link
Collaborator

Split out from #8392

  • Tests added / passed
  • Passes pre-commit run --all-files

@gjoseph92 gjoseph92 requested a review from jrbourbeau January 11, 2022 21:10
@gjoseph92 gjoseph92 self-assigned this Jan 11, 2022
@ian-r-rose
Copy link
Collaborator

It's also probably worth checking this against #8376

@gjoseph92
Copy link
Collaborator Author

Thanks @ian-r-rose, I have a feeling this PR will end up fixing that.

@gjoseph92
Copy link
Collaborator Author

I spoke too soon, this doesn't fix #8376, because the thing being tokenized in that case is not a dataclass, but the bound method of a dataclass (totally different code path). I don't even know how much that even relates to dataclasses, versus just some issue in cloudpickle that could affect other things too. Seems like @scharlottej13 is already on the right track with that one.

@gjoseph92 gjoseph92 mentioned this pull request Jan 12, 2022
2 tasks
@jrbourbeau
Copy link
Member

Looks like there was a hiccup in CI -- just pushed a commit which merges main to re-trigger CI (hope that's okay)

dask/base.py Outdated
# If dataclasses should not be compared, they should not be deterministically
# tokenized either, since the purpose of tokenization is comparision.
try:
comparable = obj.__dataclass_params__.eq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be missing something, but I didn't see any documentation around __dataclass_params__ in the dataclasses docs. Is there a more public-facing functionality we could use here instead? I suppose with suitable test coverage __dataclass_params__ is fine, but wanted to at least point this out

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't love this either, since I do think it's an internal API. There doesn't appear to be a public API for getting this sort of information.

The other more-public option I was considering was type(obj).__dict__.get("__eq__"). This is testing whether the class itself (not parents) implements an __eq__ method, because:

eq: If true (the default), an __eq__() method will be generated
https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass

However, I didn't like this, because it'll be a false-negative on cases like:

@dataclass(eq=True)
class A:
    foo: str

class B(A):
    def some_extra_method(self): ...

In this case, dataclasses.is_dataclass(B) is True, since it inherits from one. But it doesn't have an __eq__ method in its own __dict__, since it just inherits it from A, versus being generated by make_dataclass. But since it inherits this __eq__ method, we'd want tokenize on B to be deterministic, just as it is on A.

I suppose we could test for type(obj).__eq__ is not object.__eq__, indicating that something in the MRO defines its own __eq__ method.

But frankly, this feels to me like it's relying about as much on implementation details as obj.__dataclass_params__.eq, and it's quite a bit harder to read and understand.


The last option is to just not care about __eq__, and deterministically tokenize dataclasses even if they're marked as eq=False. I think this would still be pretty reasonable. I'm kind of on the fence about whether it even matters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last option is to just not care about eq, and deterministically tokenize dataclasses even if they're marked as eq=False. I think this would still be pretty reasonable. I'm kind of on the fence about whether it even matters.

Yeah, I wouldn't worry about this at all. tokenize isn't really about equality (and we don't mandate tokenize-able types implement equality anywhere else), I'd just skip this check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I just felt like if someone went to the trouble of saying eq=False on their dataclass, there is probably a good reason for it. That reason is probably because the contents don't/shouldn't support equality, which means tokenize on those elements will probably be nondeterministic, so it'll probably be fine if we ignore eq=False and traverse anyway. I'd just prefer not making assumptions like this and respecting what users set explicitly.

I agree we don't care about it anywhere else, but tokenize also doesn't automatically support any other user-defined types. Up until now, any user-defined classes would just get nondeterministic tokens unless you explicitly added __dask_tokenize__, so this feels like a special case. And incorrectly giving something a deterministic token that shouldn't have one can cause errors that are very hard for users to debug. As far as I'm aware, tokenize doesn't deterministically support any types that don't support __eq__ right now.

tokenize isn't really about equality

To me it is though—it's about being able to tell if two keys represent equivalent results. Not even within one graph, but across all graphs in a session (caching uses it this way, at least).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I'm aware, tokenize doesn't deterministically support any types that don't support eq right now.

Sure it does. Numpy arrays for example can be equivalent, but override __eq__ to do their own thing that means that a == b isn't a valid equivalency check. Sometimes you may want to set eq=False to fallback to identity checks for equality, since the builtin dataclass comparison function would error:

In [1]: import dataclasses

In [2]: import numpy as np

In [3]: @dataclasses.dataclass
   ...: class Add:
   ...:     x: np.ndarray
   ...:     y: np.ndarray
   ...: 

In [4]: op1 = Add(np.array([1, 2]), np.array([3, 4]))

In [5]: op2 = Add(np.array([1, 2]), np.array([3, 4]))

In [6]: op1 == op2
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-b6827372ab5d> in <module>
----> 1 op1 == op2

<string> in __eq__(self, other)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

foo(x1) == foo(x2) may be related to bool(x1 == x2), but one doesn't imply the other (in either direction). Some objects don't even compare equal to themselves!

In [13]: np.nan  == np.nan
Out[13]: False

Forgetting the philosophical arguments above - we already traverse builtin collections & all dataclasses to find dask objects within them (viewing them more like declarative typed dicts than true user-defined classes). It's simpler to say "dask natively traverses and tokenizes builtin types & dataclasses" than to add a qualifier for tokenization for dataclasses that don't support __eq__. But I also don't expect this to ever really come up in a real user use case, so if you feel that strongly about it up to you.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm somewhat convinced by setting eq=False just for convenience, even when logically it should support equality, just because the contents don't support boolean equality and you don't feel like writing a custom __eq__ method.

What do you think about fields specified as compare=False though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A field specified as compare=False may be there to hold something that doesn't affect equality (like a cache or something), so I could see an argument for skipping those when tokenizing things, but honestly I'd just do the dumb simple thing and traverse all fields for tokenization. If a user needs special behavior (some fields ignored, etc...), they can implement __dask_tokenize__ themselves. It's easy to explain, easy to document, and matches the traversal behavior we do elsewhere already.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it's no longer caring about comparability at all.

dask/base.py Outdated
# If dataclasses should not be compared, they should not be deterministically
# tokenized either, since the purpose of tokenization is comparision.
try:
comparable = obj.__dataclass_params__.eq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last option is to just not care about eq, and deterministically tokenize dataclasses even if they're marked as eq=False. I think this would still be pretty reasonable. I'm kind of on the fence about whether it even matters.

Yeah, I wouldn't worry about this at all. tokenize isn't really about equality (and we don't mandate tokenize-able types implement equality anywhere else), I'd just skip this check.

These are failing because cloudpickle protocol=0 can't pickle dataclasses. So we fall back to using using `str(func)`, which is exactly what we wanted to avoid.
@gjoseph92
Copy link
Collaborator Author

I added protocol=4 to this PR, but @scharlottej13 why don't we plan on merging #8527 first, just to have the change (and tests) separate?

@gjoseph92
Copy link
Collaborator Author

Looks like there are at least 2 tests that were depending on protocol 0:

def test_name_consistent_across_instances():
func = delayed(identity, pure=True)
data = {"x": 1, "y": 25, "z": [1, 2, 3]}
assert func(data)._key == "identity-02129ed1acaffa7039deee80c5da547c"
data = {"x": 1, 1: "x"}
assert func(data)._key == func(data)._key
assert func(1)._key == "identity-ca2fae46a3b938016331acac1908ae45"

def test_tokenize_partial_func_args_kwargs_consistent():
f = partial(f3, f2, c=f1)
res = normalize_token(f)
sol = (
b"cdask.tests.test_base\nf3\np0\n.",
(b"cdask.tests.test_base\nf2\np0\n.",),
(("c", b"cdask.tests.test_base\nf1\np0\n."),),
)
assert res == sol

I can update them, but I do worry somewhat about who's depending on tokenization stability.

@ian-r-rose
Copy link
Collaborator

I can update them, but I do worry somewhat about who's depending on tokenization stability.

I've gotta say, neither of those look like particularly good tests to me -- I'd consider long-term stability of tokenization to be not guaranteed, and anyone relying on it probably shouldn't. To me, a better test would be to ensure that the tokenization is repeatable, as the first one does, and that a slight variation in the args results in something different (like would have caught #8450 )

@jcrist
Copy link
Member

jcrist commented Feb 1, 2022

Yeah, we make no guarantees that updates to dask won't change the results from tokenize(x). The main things we worry about (in order or importance) are:

  • Runs of tokenize(x) in the same process always return the same result. We mainly care about this one.
  • Runs of tokenize(x) in different processes always return the same result (if possible). This means we avoid using things like hash(x) which has a randomized seed per process.
  • Runs of tokenize(x) in different python versions return the same result (if possible).

@gjoseph92
Copy link
Collaborator Author

Think this is good to merge

@jsignell
Copy link
Member

Thanks for sticking with this @gjoseph92!

@jsignell jsignell merged commit 8971c37 into dask:main Feb 24, 2022
phobson pushed a commit to phobson/dask that referenced this pull request Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants