Tokenize dataclasses by gjoseph92 · Pull Request #8557 · dask/dask

gjoseph92 · 2022-01-11T21:10:46Z

Split out from #8392

Tests added / passed
Passes pre-commit run --all-files

dask/base.py

dask/tests/test_base.py

ian-r-rose · 2022-01-11T22:34:20Z

It's also probably worth checking this against #8376

gjoseph92 · 2022-01-11T22:40:16Z

Thanks @ian-r-rose, I have a feeling this PR will end up fixing that.

gjoseph92 · 2022-01-12T02:47:16Z

I spoke too soon, this doesn't fix #8376, because the thing being tokenized in that case is not a dataclass, but the bound method of a dataclass (totally different code path). I don't even know how much that even relates to dataclasses, versus just some issue in cloudpickle that could affect other things too. Seems like @scharlottej13 is already on the right track with that one.

…aclasses

jrbourbeau · 2022-01-13T16:55:25Z

Looks like there was a hiccup in CI -- just pushed a commit which merges main to re-trigger CI (hope that's okay)

dask/base.py

jrbourbeau · 2022-01-13T17:57:39Z

dask/base.py

+    # If dataclasses should not be compared, they should not be deterministically
+    # tokenized either, since the purpose of tokenization is comparision.
+    try:
+        comparable = obj.__dataclass_params__.eq


I could be missing something, but I didn't see any documentation around __dataclass_params__ in the dataclasses docs. Is there a more public-facing functionality we could use here instead? I suppose with suitable test coverage __dataclass_params__ is fine, but wanted to at least point this out

I didn't love this either, since I do think it's an internal API. There doesn't appear to be a public API for getting this sort of information.

The other more-public option I was considering was type(obj).__dict__.get("__eq__"). This is testing whether the class itself (not parents) implements an __eq__ method, because:

eq: If true (the default), an __eq__() method will be generated
https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass

However, I didn't like this, because it'll be a false-negative on cases like:

@dataclass(eq=True) class A: foo: str class B(A): def some_extra_method(self): ...

In this case, dataclasses.is_dataclass(B) is True, since it inherits from one. But it doesn't have an __eq__ method in its own __dict__, since it just inherits it from A, versus being generated by make_dataclass. But since it inherits this __eq__ method, we'd want tokenize on B to be deterministic, just as it is on A.

I suppose we could test for type(obj).__eq__ is not object.__eq__, indicating that something in the MRO defines its own __eq__ method.

But frankly, this feels to me like it's relying about as much on implementation details as obj.__dataclass_params__.eq, and it's quite a bit harder to read and understand.

The last option is to just not care about __eq__, and deterministically tokenize dataclasses even if they're marked as eq=False. I think this would still be pretty reasonable. I'm kind of on the fence about whether it even matters.

The last option is to just not care about eq, and deterministically tokenize dataclasses even if they're marked as eq=False. I think this would still be pretty reasonable. I'm kind of on the fence about whether it even matters.

Yeah, I wouldn't worry about this at all. tokenize isn't really about equality (and we don't mandate tokenize-able types implement equality anywhere else), I'd just skip this check.

Sure. I just felt like if someone went to the trouble of saying eq=False on their dataclass, there is probably a good reason for it. That reason is probably because the contents don't/shouldn't support equality, which means tokenize on those elements will probably be nondeterministic, so it'll probably be fine if we ignore eq=False and traverse anyway. I'd just prefer not making assumptions like this and respecting what users set explicitly.

I agree we don't care about it anywhere else, but tokenize also doesn't automatically support any other user-defined types. Up until now, any user-defined classes would just get nondeterministic tokens unless you explicitly added __dask_tokenize__, so this feels like a special case. And incorrectly giving something a deterministic token that shouldn't have one can cause errors that are very hard for users to debug. As far as I'm aware, tokenize doesn't deterministically support any types that don't support __eq__ right now.

tokenize isn't really about equality

To me it is though—it's about being able to tell if two keys represent equivalent results. Not even within one graph, but across all graphs in a session (caching uses it this way, at least).

As far as I'm aware, tokenize doesn't deterministically support any types that don't support eq right now.

Sure it does. Numpy arrays for example can be equivalent, but override __eq__ to do their own thing that means that a == b isn't a valid equivalency check. Sometimes you may want to set eq=False to fallback to identity checks for equality, since the builtin dataclass comparison function would error:

In [1]: import dataclasses In [2]: import numpy as np In [3]: @dataclasses.dataclass ...: class Add: ...: x: np.ndarray ...: y: np.ndarray ...: In [4]: op1 = Add(np.array([1, 2]), np.array([3, 4])) In [5]: op2 = Add(np.array([1, 2]), np.array([3, 4])) In [6]: op1 == op2 --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-6-b6827372ab5d> in <module> ----> 1 op1 == op2 <string> in __eq__(self, other) ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

foo(x1) == foo(x2) may be related to bool(x1 == x2), but one doesn't imply the other (in either direction). Some objects don't even compare equal to themselves!

In [13]: np.nan == np.nan Out[13]: False

Forgetting the philosophical arguments above - we already traverse builtin collections & all dataclasses to find dask objects within them (viewing them more like declarative typed dicts than true user-defined classes). It's simpler to say "dask natively traverses and tokenizes builtin types & dataclasses" than to add a qualifier for tokenization for dataclasses that don't support __eq__. But I also don't expect this to ever really come up in a real user use case, so if you feel that strongly about it up to you.

Ok, I'm somewhat convinced by setting eq=False just for convenience, even when logically it should support equality, just because the contents don't support boolean equality and you don't feel like writing a custom __eq__ method.

What do you think about fields specified as compare=False though?

A field specified as compare=False may be there to hold something that doesn't affect equality (like a cache or something), so I could see an argument for skipping those when tokenizing things, but honestly I'd just do the dumb simple thing and traverse all fields for tokenization. If a user needs special behavior (some fields ignored, etc...), they can implement __dask_tokenize__ themselves. It's easy to explain, easy to document, and matches the traversal behavior we do elsewhere already.

Ok, it's no longer caring about comparability at all.

dask/base.py

dask/tests/test_base.py

dask/base.py

jcrist · 2022-01-26T17:38:08Z

dask/base.py

+    # If dataclasses should not be compared, they should not be deterministically
+    # tokenized either, since the purpose of tokenization is comparision.
+    try:
+        comparable = obj.__dataclass_params__.eq


The last option is to just not care about eq, and deterministically tokenize dataclasses even if they're marked as eq=False. I think this would still be pretty reasonable. I'm kind of on the fence about whether it even matters.

Yeah, I wouldn't worry about this at all. tokenize isn't really about equality (and we don't mandate tokenize-able types implement equality anywhere else), I'd just skip this check.

dask/base.py

These are failing because cloudpickle protocol=0 can't pickle dataclasses. So we fall back to using using `str(func)`, which is exactly what we wanted to avoid.

dask/tests/test_base.py

gjoseph92 · 2022-01-28T19:40:22Z

I added protocol=4 to this PR, but @scharlottej13 why don't we plan on merging #8527 first, just to have the change (and tests) separate?

gjoseph92 · 2022-01-28T21:13:37Z

Looks like there are at least 2 tests that were depending on protocol 0:

dask/dask/tests/test_delayed.py

Lines 542 to 550 in bdde5e1

    
           def test_name_consistent_across_instances(): 
        
               func = delayed(identity, pure=True) 
        
               data = {"x": 1, "y": 25, "z": [1, 2, 3]} 
        
               assert func(data)._key == "identity-02129ed1acaffa7039deee80c5da547c" 
        
               data = {"x": 1, 1: "x"} 
        
               assert func(data)._key == func(data)._key 
        
               assert func(1)._key == "identity-ca2fae46a3b938016331acac1908ae45"

dask/dask/tests/test_base.py

Lines 224 to 232 in bdde5e1

    
           def test_tokenize_partial_func_args_kwargs_consistent(): 
        
               f = partial(f3, f2, c=f1) 
        
               res = normalize_token(f) 
        
               sol = ( 
        
                   b"cdask.tests.test_base\nf3\np0\n.", 
        
                   (b"cdask.tests.test_base\nf2\np0\n.",), 
        
                   (("c", b"cdask.tests.test_base\nf1\np0\n."),), 
        
               ) 
        
               assert res == sol

I can update them, but I do worry somewhat about who's depending on tokenization stability.

ian-r-rose · 2022-01-28T21:37:25Z

I can update them, but I do worry somewhat about who's depending on tokenization stability.

I've gotta say, neither of those look like particularly good tests to me -- I'd consider long-term stability of tokenization to be not guaranteed, and anyone relying on it probably shouldn't. To me, a better test would be to ensure that the tokenization is repeatable, as the first one does, and that a slight variation in the args results in something different (like would have caught #8450 )

jcrist · 2022-02-01T17:07:20Z

Yeah, we make no guarantees that updates to dask won't change the results from tokenize(x). The main things we worry about (in order or importance) are:

Runs of tokenize(x) in the same process always return the same result. We mainly care about this one.
Runs of tokenize(x) in different processes always return the same result (if possible). This means we avoid using things like hash(x) which has a randomized seed per process.
Runs of tokenize(x) in different python versions return the same result (if possible).

gjoseph92 · 2022-02-24T19:01:09Z

Think this is good to merge

jsignell · 2022-02-24T22:41:20Z

Thanks for sticking with this @gjoseph92!

Tokenize dataclasses

2fe986a

gjoseph92 requested a review from jrbourbeau January 11, 2022 21:10

gjoseph92 self-assigned this Jan 11, 2022

jrbourbeau reviewed Jan 11, 2022

View reviewed changes

dask/base.py Show resolved Hide resolved

dask/base.py Outdated Show resolved Hide resolved

dask/tests/test_base.py Outdated Show resolved Hide resolved

single import of dataclasses

8e364a0

Handle comparability, type, field names

c8e3be5

gjoseph92 mentioned this pull request Jan 12, 2022

Updates for P2P shuffle skeleton #8392

Closed

2 tasks

Merge branch 'main' of https://github.com/dask/dask into tokenize-dat…

b3262c5

…aclasses

jrbourbeau reviewed Jan 13, 2022

View reviewed changes

gjoseph92 added 3 commits January 13, 2022 13:02

confirm subclass is a dataclass

c6cabc8

note why AttributeError is handled

cca39ff

test error with deterministic tokens

35d4e34

jcrist requested changes Jan 26, 2022

View reviewed changes

gjoseph92 added 3 commits January 26, 2022 13:31

Don't care about eq

7f9268e

Ignore field comparability too

b8c328e

Test for name reuse

2f01fa4

These are failing because cloudpickle protocol=0 can't pickle dataclasses. So we fall back to using using `str(func)`, which is exactly what we wanted to avoid.

gjoseph92 commented Jan 28, 2022

View reviewed changes

dask/tests/test_base.py Show resolved Hide resolved

scharlottej13 mentioned this pull request Jan 28, 2022

Fix for normalize_function dataclass methods #8527

Merged

1 task

use pickle protocol 4

999d81d

Merge remote-tracking branch 'upstream/main' into tokenize-dataclasses

68253ca

jsignell merged commit 8971c37 into dask:main Feb 24, 2022

phobson pushed a commit to phobson/dask that referenced this pull request Feb 28, 2022

Tokenize dataclasses (dask#8557)

838d7b4

Uh oh!

Conversation

gjoseph92 commented Jan 11, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ian-r-rose commented Jan 11, 2022

Uh oh!

gjoseph92 commented Jan 11, 2022

Uh oh!

gjoseph92 commented Jan 12, 2022

Uh oh!

jrbourbeau commented Jan 13, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gjoseph92 commented Jan 28, 2022

Uh oh!

gjoseph92 commented Jan 28, 2022

Uh oh!

ian-r-rose commented Jan 28, 2022

Uh oh!

jcrist commented Feb 1, 2022

Uh oh!

gjoseph92 commented Feb 24, 2022

Uh oh!

jsignell commented Feb 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants