Regression in work stealing for blacklisted fast tasks by fjetter · Pull Request #3591 · dask/distributed

fjetter · 2020-03-19T09:39:06Z

This is a regression which dates back to 2.9.1

Through the introduction of the TaskPrefix code, the work stealing for shuffle operations behaves drastically different (not for the better) which can be tracked back to the split in fast_tasks check which fails since the split was a TaskPrefix object instead of the string shuffle-split in the fast_tasks dict

fjetter · 2020-03-19T10:32:19Z

I believe build failures are unrelated. There is one failure connected to work stealing on Windows Python3.7 but so far I couldn't reproduce this. Maybe flaky? Are there tests known to be flaky in the test_stealing.py module?

jrbourbeau

Thank you for taking the time to track this down @fjetter!

I think we need a similar change here

distributed/distributed/stealing.py

Line 87 in 2acffc3

for tts in self.stealable_unknown_durations.pop(ts.prefix, ()):

Re: test failures, test_dont_steal_unknown_functions is a known flaky test (xref #3574)

EDIT: There are also some known TLS test failures right now (xref #3588)

fjetter · 2020-03-20T08:44:05Z

I added the prefix.name patch to the line you mentioned but I couldn't add a test for it. In fact, I would argue it is unreachable code since the only place where we populate this dictionary is

distributed/distributed/stealing.py

Line 140 in 0d64f3a

self.stealable_unknown_durations[split].add(ts)

but this line assumes that ts.processing_on is None which is an invalid state since all stealing dicts are only populated for tasks in state processing and as soon as a task is transitioned away from processing we'll remove it from stealable.
Is there something I'm missing? Can this be removed? (Other PR, of course)

mrocklin · 2020-03-22T18:42:42Z

distributed/tests/test_steal.py

+
+        await wait(futures)
+
+    mmock.assert_not_called()


This test is highly dependent on the internal API. Historically we've tried to avoid tests like this, because it makes changing internal code really hard (future developers have to understand and change all tests that depend on the internal API). Instead, I wonder if you could use something like c.map(..., workers=[...], allow_other_workers=True) and then test that none of the other workers got any data. This only requires that we keep the worker.data API, which I think is probably more stable than the move_task_request API.

Also, do we need the timeout=1000 here, or was this just for debugging?

Thanks, I already suspected this to become an issue during review :)

I didn't know about the allow_other_workers flag, will rewrite the test, ofc.

No mocks

Also, do we need the timeout=1000 here, or was this just for debugging?

Yes, that was debugging. No timeout should be necessary here.

No timeouts

mrocklin · 2020-03-22T18:45:17Z

but this line assumes that ts.processing_on is None which is an invalid state since all stealing dicts are only populated for tasks in state processing and as soon as a task is transitioned away from processing we'll remove it from stealable.
Is there something I'm missing? Can this be removed? (Other PR, of course)

If this is true then yes, it would be good to remove this code. It might be worth checking out git blame to see why this was added in the first place, and spot checking the various ways in which tasks can be removed from the processing state to make sure that we're not missing any stealable cleanups.

mrocklin · 2020-03-22T18:45:38Z

Thank you for identifying and resolving this issue @fjetter . I imagine that it was not trivial to track down.

fjetter · 2020-03-23T07:51:44Z

If this is true then yes, it would be good to remove this code. It might be worth checking out git blame to see why this was added in the first place

I'll try to investigate the history and come back with what I find.

I imagine that it was not trivial to track down.

Indeed, this one was uncomfortable. We do not really instrument the work stealing which made me fall back to "watching dashboards", greping logs and pulling scheduler/workerr state with client.run*. I this particular case a mere "unique prefixes which were subject to stealing" would've shown the issue already but I suspect this is worthless in most other scenarios. I'm wondering if it's worth collecting things like average steal_time_ratio per task/taskprefix, total count of stolen tasks, most likely victims / thieves, etc. Has this issue ever come up somewhere else?

… stolen

mrocklin · 2020-03-23T20:46:10Z

Thanks @fjetter . This is in.

fjetter force-pushed the bugifx/workstealing_regression branch 5 times, most recently from d6bfd16 to b6995a3 Compare March 19, 2020 09:47

fjetter changed the title ~~Fix regression where blacklisted fast tasks where still allowed to be stolen~~ Regression in work stealing for blacklisted fast tasks Mar 19, 2020

jrbourbeau reviewed Mar 19, 2020

View reviewed changes

fjetter force-pushed the bugifx/workstealing_regression branch from b6995a3 to a6d924e Compare March 20, 2020 08:11

mrocklin reviewed Mar 22, 2020

View reviewed changes

fjetter force-pushed the bugifx/workstealing_regression branch from a6d924e to 5240344 Compare March 23, 2020 09:26

fjetter added 2 commits March 23, 2020 10:45

Fix regression where blacklisted fast tasks where still allowed to be…

38a6cd6

… stolen

Fix ts.prefix for stealable_unknown_durations cleanup

58e0e8a

fjetter force-pushed the bugifx/workstealing_regression branch from 5240344 to 58e0e8a Compare March 23, 2020 09:45

fjetter mentioned this pull request Mar 23, 2020

Remove dead stealing code #3619

Merged

mrocklin merged commit 4d4d935 into dask:master Mar 23, 2020

fjetter deleted the bugifx/workstealing_regression branch March 24, 2020 06:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regression in work stealing for blacklisted fast tasks#3591

Regression in work stealing for blacklisted fast tasks#3591
mrocklin merged 2 commits intodask:masterfrom
fjetter:bugifx/workstealing_regression

fjetter commented Mar 19, 2020

Uh oh!

fjetter commented Mar 19, 2020

Uh oh!

jrbourbeau left a comment •

edited

Loading

Uh oh!

fjetter commented Mar 20, 2020

Uh oh!

mrocklin Mar 22, 2020

Uh oh!

fjetter Mar 23, 2020 •

edited

Loading

Uh oh!

mrocklin commented Mar 22, 2020

Uh oh!

mrocklin commented Mar 22, 2020

Uh oh!

fjetter commented Mar 23, 2020

Uh oh!

mrocklin commented Mar 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

fjetter commented Mar 19, 2020

Uh oh!

fjetter commented Mar 19, 2020

Uh oh!

jrbourbeau left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter commented Mar 20, 2020

Uh oh!

mrocklin Mar 22, 2020

Choose a reason for hiding this comment

Uh oh!

fjetter Mar 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Mar 22, 2020

Uh oh!

mrocklin commented Mar 22, 2020

Uh oh!

fjetter commented Mar 23, 2020

Uh oh!

mrocklin commented Mar 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jrbourbeau left a comment •

edited

Loading

fjetter Mar 23, 2020 •

edited

Loading