Remove EnsureCommunicatingAfterTransitions#6462
Conversation
|
[EDIT] this comment refers to a previous version of the PR. The condition described below is still there, but only triggered by multiple events in short sequence. This PR introduces a O(n^2*logn) condition where
This should be negligible most of the time. I'll write another PR later on to make running _ensure_communicating twice in a row truly negligible (blocked by #6388). To clarify: this condition is already there when you have many compute-task and acquire-replicas requests, which cause the data_needed queue to expand rapidly. This PR specifically extends the condition to tasks fetched within the same event. |
Unit Test Results 15 files + 12 15 suites +12 6h 40m 57s ⏱️ + 5h 53m 11s For more details on these failures, see this check. Results for commit e1058fe. ± Comparison against base commit 69b798d. ♻️ This comment has been updated with latest results. |
|
[EDIT] this comment refers to a previous version of the PR and is now obsolete. The transition log has changed from to This... is correct, but it's very counter-intuitive.
Again, all this is correct, but it's confusing; it took me an unhealthy amount of time to figure out why the |
2c32233 to
b7a4538
Compare
201bd67 to
17ce3ef
Compare
As I'm arguing in #6442 (comment) I believe this log should simply be removed |
|
I could reproduce the issue about the assumptions in the test about which keys to be fetched. This is not entirely obvious from the tests and I consider this a bit concerning. It's also not about any priorities, ordering, etc. but rather that we're using an unordered set for distributed/distributed/worker.py Lines 2125 to 2129 in 6d85a85 This randomness was previously buffered by the delayed _ensure_communicating. I don't feel great about introducing randomness to our scheduling (e.g. there is not even a key-based tie breaker). This entire refactoring effort is supposed to help us make things more deterministic. |
|
This has been parked until a new design is agreed upon in #6497. |
aa5273d to
1190bcc
Compare
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 10h 10m 51s ⏱️ +27s For more details on these failures, see this check. Results for commit 9051170. ± Comparison against base commit 88e1fe0. ♻️ This comment has been updated with latest results. |
1190bcc to
ab0e9a1
Compare
ab0e9a1 to
5715261
Compare
|
The PR has been rewritten from scratch and is now ready for review and merge. |
| return merge_recs_instructions( | ||
| (recommendations, []), | ||
| self._ensure_communicating(stimulus_id=ev.stimulus_id), | ||
| ) |
There was a problem hiding this comment.
I love that this is not everywhere anymore ❤️
Partially closes #6497
#6165 introduced this hack, for the sake of being functionally identical to the previous code.
This is a cleaner redesign which is conceptually the same.