Update comparison logic for worker state#3321
Conversation
|
Thanks. Do you think this will actually fix #3256? This being the cause was a shot in the dark. And do you know if / how |
|
This may just address a symptom and not the root problem, but it seems like a good starting point considering the log messages @bnaul was seeing. |
|
Ok. Do you know how |
|
I think In [15]: from distributed.scheduler import WorkerState
In [16]: a = WorkerState()
In [17]: b = WorkerState()
In [18]: a == b
Out[18]: FalseThis may not be a problem if WorkerState instances are expected to be singletons per worker, which I suspect they are. But it does mean that this likely isn't fixing the underlying issue. |
|
There is another occurrence of the same identity comparison here distributed/distributed/scheduler.py Line 2560 in d747f63 While I have no proof, I could totally imagine that race conditions between |
@TomAugspurger it looks like |
|
Rather than conditionals I usually do something like ... Also, if we're going to implement |
Probably not at all important for this case but doesn't this pattern kind of break subclassing? How about
Do we currently rely on |
|
If they're of different classes then can they be equal? (but yeah, in truth it doesn't matter) Fair enough on hashability (assuming that it doesn't come up, which it may) |
Define __eq__ and __hash__ for WorkerState
|
@mrocklin you were right that hashability is needed so we added that as well and changed @TomAugspurger anecdotally we haven't seen #3256 since we started running with this change, but I wouldn't say we've conclusively proven it's resolved since unfortunately that's a random unpredictable failure. regardless this seems like the "right" check to be performing and should help w/ possible race conditions as pointed out by @StephanErb, so my vote would be to merge this once everyone's happy with it and close #3256 until someone observes that same behavior again. |
|
Has anyone seen this CI failure before? I've restarted that build. |
|
Not recently, but yes, it used to be an intermittent failure.
…On Tue, Feb 11, 2020 at 5:23 AM Tom Augspurger ***@***.***> wrote:
Has anyone seen this CI failure before?
=================================== FAILURES ===================================
____________________________ test_pickle_functions _____________________________
def test_pickle_functions():
def make_closure():
value = 1
def f(x): # closure
return x + value
return f
def funcs():
yield make_closure()
yield (lambda x: x + 1)
yield partial(add, 1)
for func in funcs():
wr = weakref.ref(func)
func2 = loads(dumps(func))
wr2 = weakref.ref(func2)
assert func2(1) == func(1)
del func, func2
gc.collect()
> assert wr() is None
E AssertionError: assert <function test_pickle_functions.<locals>.funcs.<locals>.<lambda> at 0x7f0ab6f535f0> is None
E + where <function test_pickle_functions.<locals>.funcs.<locals>.<lambda> at 0x7f0ab6f535f0> = <weakref at 0x7f0a8cc05350; to 'function' at 0x7f0ab6f535f0 (<lambda>)>()
distributed/protocol/tests/test_pickle.py:47: AssertionError`
I've restarted that build.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3321?email_source=notifications&email_token=AACKZTER3ZA3WJTT5US4FMTRCKRGRA5CNFSM4J2TLRWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELMMN5I#issuecomment-584632053>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTDCXKDCFHYSALANVELRCKRGRANCNFSM4J2TLRWA>
.
|
|
OK, thanks. We'll merge this and keep an eye out for that failure again. Thanks @rockwellw! |
| return hash((self.name, self.host)) | ||
|
|
||
| def __eq__(self, other): | ||
| return type(self) == type(other) and hash(self) == hash(other) |
There was a problem hiding this comment.
Sorry for being late to the party. There is a bug in here: It is not guaranteed that two WorkerState objects are actually the same even though there hash value is the same. Hashing is not a bijective function.
I think we need to go the extra route here and check for (self.name, self.host) == (other.name, other.host)
There was a problem hiding this comment.
@TomAugspurger @bnaul please see above.
While I totally agree that hash collisions are very very unlikely, Dask has a massive install base with sometimes hundreds to thousands of workers per cluster. So as far as we know the once in a million event could happen next Tuesday. 🤷♂
Addresses #3256 , based on @TomAugspurger 's comment.