Regression in work stealing for blacklisted fast tasks#3591
Regression in work stealing for blacklisted fast tasks#3591mrocklin merged 2 commits intodask:masterfrom
Conversation
d6bfd16 to
b6995a3
Compare
|
I believe build failures are unrelated. There is one failure connected to work stealing on Windows Python3.7 but so far I couldn't reproduce this. Maybe flaky? Are there tests known to be flaky in the |
There was a problem hiding this comment.
Thank you for taking the time to track this down @fjetter!
I think we need a similar change here
distributed/distributed/stealing.py
Line 87 in 2acffc3
Re: test failures, test_dont_steal_unknown_functions is a known flaky test (xref #3574)
EDIT: There are also some known TLS test failures right now (xref #3588)
b6995a3 to
a6d924e
Compare
|
I added the prefix.name patch to the line you mentioned but I couldn't add a test for it. In fact, I would argue it is unreachable code since the only place where we populate this dictionary is distributed/distributed/stealing.py Line 140 in 0d64f3a but this line assumes that ts.processing_on is None which is an invalid state since all stealing dicts are only populated for tasks in state processing and as soon as a task is transitioned away from processing we'll remove it from stealable.Is there something I'm missing? Can this be removed? (Other PR, of course) |
distributed/tests/test_steal.py
Outdated
|
|
||
| await wait(futures) | ||
|
|
||
| mmock.assert_not_called() |
There was a problem hiding this comment.
This test is highly dependent on the internal API. Historically we've tried to avoid tests like this, because it makes changing internal code really hard (future developers have to understand and change all tests that depend on the internal API). Instead, I wonder if you could use something like c.map(..., workers=[...], allow_other_workers=True) and then test that none of the other workers got any data. This only requires that we keep the worker.data API, which I think is probably more stable than the move_task_request API.
Also, do we need the timeout=1000 here, or was this just for debugging?
There was a problem hiding this comment.
Thanks, I already suspected this to become an issue during review :)
I didn't know about the allow_other_workers flag, will rewrite the test, ofc.
- No mocks
Also, do we need the timeout=1000 here, or was this just for debugging?
Yes, that was debugging. No timeout should be necessary here.
- No timeouts
If this is true then yes, it would be good to remove this code. It might be worth checking out |
|
Thank you for identifying and resolving this issue @fjetter . I imagine that it was not trivial to track down. |
I'll try to investigate the history and come back with what I find.
Indeed, this one was uncomfortable. We do not really instrument the work stealing which made me fall back to "watching dashboards", greping logs and pulling scheduler/workerr state with |
a6d924e to
5240344
Compare
5240344 to
58e0e8a
Compare
|
Thanks @fjetter . This is in. |
This is a regression which dates back to 2.9.1
Through the introduction of the TaskPrefix code, the work stealing for shuffle operations behaves drastically different (not for the better) which can be tracked back to the
split in fast_taskscheck which fails since the split was aTaskPrefixobject instead of the stringshuffle-splitin thefast_tasksdict