Conversation
fd3cbb0 to
9a90025
Compare
9a90025 to
3da4d9e
Compare
Member
Author
|
With #5046 this problem still persists. I updated the patch and wrote a new test for the case where all keys are filtered out in which case gather_dep should be a no-op |
3da4d9e to
1e570a0
Compare
3 tasks
Member
Author
|
I would like to merge the PRs around work stealing in the following order
Since this is the first in line and relatively small, I'll go ahead and merge since other PRs partially fail due to this race condition. If there are any comments, I'll happily work them in afterwards |
3 tasks
crusaderky
reviewed
Oct 11, 2021
| # Keep namespace clean since this func is long and has many | ||
| # dep*, *ts* variables | ||
|
|
||
| assert cause is not None |
Collaborator
There was a problem hiding this comment.
This is required by mypy, as later in the code you're accessing a property of cause.
Reinstated in #2393.
This was referenced Oct 11, 2021
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
There are two race conditions in here. I could only produce one of them in a test
If a key is transitioned to "flight", a
gather_deptask for this key is scheduled. This race condition requires the task to be scheduled in batch with another set of tasks. If one task of this batch is released before thegather_depcoroutine is first executed but after the transition inensure_communicate, thegather_depcoroutine will filter the released task and will only try to fetch results of the remote for the tasks still in state "flight".Once the response is available, all keys are tried to be parsed in the response, not only the actually fetched ones. I marked this situation with a FIXME a while ago but didn't anticipate the impact nor did I have an actual test backing this up.
Two things can now happen:
get_data_from_workerthe forgotten key is rescheduled, this results in us taking the "missing-data" path since the data for the key is not in the response, the task is not in state memory, the remote was not busy and the task has still dependents since it was rescheduled. In this case the worker issues a missing data signal to the scheduler which deletes the possibly only replica of that task on the remote which in turn might escalate pretty badly.I tried very hard in reproducing 2.) using multiple mocks and locks 🤯 but it is extremely difficult to produce due to the rather monolithic design of that particular section of code. I settled for for a test which covers only 1.) since 2.) is a special case of 1.) I intend to eliminate such race conditions by implementing a cancelled state as described in #4413 (comment) but this is way out of scope for this PR. The test will still be useful, though.