Skip to content

Regression in work stealing for blacklisted fast tasks#3591

Merged
mrocklin merged 2 commits intodask:masterfrom
fjetter:bugifx/workstealing_regression
Mar 23, 2020
Merged

Regression in work stealing for blacklisted fast tasks#3591
mrocklin merged 2 commits intodask:masterfrom
fjetter:bugifx/workstealing_regression

Conversation

@fjetter
Copy link
Copy Markdown
Member

@fjetter fjetter commented Mar 19, 2020

This is a regression which dates back to 2.9.1

Through the introduction of the TaskPrefix code, the work stealing for shuffle operations behaves drastically different (not for the better) which can be tracked back to the split in fast_tasks check which fails since the split was a TaskPrefix object instead of the string shuffle-split in the fast_tasks dict

@fjetter fjetter force-pushed the bugifx/workstealing_regression branch 5 times, most recently from d6bfd16 to b6995a3 Compare March 19, 2020 09:47
@fjetter
Copy link
Copy Markdown
Member Author

fjetter commented Mar 19, 2020

I believe build failures are unrelated. There is one failure connected to work stealing on Windows Python3.7 but so far I couldn't reproduce this. Maybe flaky? Are there tests known to be flaky in the test_stealing.py module?

@fjetter fjetter changed the title Fix regression where blacklisted fast tasks where still allowed to be stolen Regression in work stealing for blacklisted fast tasks Mar 19, 2020
Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking the time to track this down @fjetter!

I think we need a similar change here

for tts in self.stealable_unknown_durations.pop(ts.prefix, ()):

Re: test failures, test_dont_steal_unknown_functions is a known flaky test (xref #3574)

EDIT: There are also some known TLS test failures right now (xref #3588)

@fjetter fjetter force-pushed the bugifx/workstealing_regression branch from b6995a3 to a6d924e Compare March 20, 2020 08:11
@fjetter
Copy link
Copy Markdown
Member Author

fjetter commented Mar 20, 2020

I added the prefix.name patch to the line you mentioned but I couldn't add a test for it. In fact, I would argue it is unreachable code since the only place where we populate this dictionary is

self.stealable_unknown_durations[split].add(ts)

but this line assumes that ts.processing_on is None which is an invalid state since all stealing dicts are only populated for tasks in state processing and as soon as a task is transitioned away from processing we'll remove it from stealable.
Is there something I'm missing? Can this be removed? (Other PR, of course)


await wait(futures)

mmock.assert_not_called()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is highly dependent on the internal API. Historically we've tried to avoid tests like this, because it makes changing internal code really hard (future developers have to understand and change all tests that depend on the internal API). Instead, I wonder if you could use something like c.map(..., workers=[...], allow_other_workers=True) and then test that none of the other workers got any data. This only requires that we keep the worker.data API, which I think is probably more stable than the move_task_request API.

Also, do we need the timeout=1000 here, or was this just for debugging?

Copy link
Copy Markdown
Member Author

@fjetter fjetter Mar 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I already suspected this to become an issue during review :)

I didn't know about the allow_other_workers flag, will rewrite the test, ofc.

  • No mocks

Also, do we need the timeout=1000 here, or was this just for debugging?

Yes, that was debugging. No timeout should be necessary here.

  • No timeouts

@mrocklin
Copy link
Copy Markdown
Member

but this line assumes that ts.processing_on is None which is an invalid state since all stealing dicts are only populated for tasks in state processing and as soon as a task is transitioned away from processing we'll remove it from stealable.
Is there something I'm missing? Can this be removed? (Other PR, of course)

If this is true then yes, it would be good to remove this code. It might be worth checking out git blame to see why this was added in the first place, and spot checking the various ways in which tasks can be removed from the processing state to make sure that we're not missing any stealable cleanups.

@mrocklin
Copy link
Copy Markdown
Member

Thank you for identifying and resolving this issue @fjetter . I imagine that it was not trivial to track down.

@fjetter
Copy link
Copy Markdown
Member Author

fjetter commented Mar 23, 2020

If this is true then yes, it would be good to remove this code. It might be worth checking out git blame to see why this was added in the first place

I'll try to investigate the history and come back with what I find.

I imagine that it was not trivial to track down.

Indeed, this one was uncomfortable. We do not really instrument the work stealing which made me fall back to "watching dashboards", greping logs and pulling scheduler/workerr state with client.run*. I this particular case a mere "unique prefixes which were subject to stealing" would've shown the issue already but I suspect this is worthless in most other scenarios. I'm wondering if it's worth collecting things like average steal_time_ratio per task/taskprefix, total count of stolen tasks, most likely victims / thieves, etc. Has this issue ever come up somewhere else?

@fjetter fjetter force-pushed the bugifx/workstealing_regression branch from a6d924e to 5240344 Compare March 23, 2020 09:26
@fjetter fjetter force-pushed the bugifx/workstealing_regression branch from 5240344 to 58e0e8a Compare March 23, 2020 09:45
@mrocklin mrocklin merged commit 4d4d935 into dask:master Mar 23, 2020
@mrocklin
Copy link
Copy Markdown
Member

Thanks @fjetter . This is in.

@fjetter fjetter deleted the bugifx/workstealing_regression branch March 24, 2020 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants