Remove EnsureCommunicatingAfterTransitions by crusaderky · Pull Request #6462 · dask/distributed

crusaderky · 2022-05-26T16:18:16Z

Partially closes #6497

#6165 introduced this hack, for the sake of being functionally identical to the previous code.
This is a cleaner redesign which is conceptually the same.

distributed/tests/test_worker.py

crusaderky · 2022-05-26T16:36:25Z

[EDIT] this comment refers to a previous version of the PR. The condition described below is still there, but only triggered by multiple events in short sequence.

This PR introduces a O(n^2*logn) condition where

task y has 100 dependencies x1, ... x100 all from the same few workers (nworkers << ntasks)
x1 transitions released->fetch->_ensure_communicating->flight. len(data_needed) == 0.
x2 transitions released->fetch->_ensure_communicating-> remain in fetch. len(data_needed) == 1; we had to skip 0 tasks before it in data_needed.
...x100 transitions released->fetch->_ensure_communicating-> remain in fetch. len(data_needed) == 99; we had to skip 98 tasks before it in data_needed.
So we had ~100^2/2 ~=5000 pure-CPU iterations that get out of data_needed, write to skipped_worker_in_flight_or_busy, and then add back to data_needed at the end of _ensure_communicating. As data_needed is a heap, each push to it is O(logn), where n is the number of tasks contained within.

This should be negligible most of the time. I'll write another PR later on to make running _ensure_communicating twice in a row truly negligible (blocked by #6388).

To clarify: this condition is already there when you have many compute-task and acquire-replicas requests, which cause the data_needed queue to expand rapidly. This PR specifically extends the condition to tasks fetched within the same event.

github-actions · 2022-05-26T20:08:20Z

Unit Test Results

      15 files +      12       15 suites +12 6h 40m 57s ⏱️ + 5h 53m 11s
  2 831 tests +  1 635   2 748 ✔️ +  1 586   81 💤 +  47 2 ❌ +2
20 979 runs +17 394 20 032 ✔️ +16 549 945 💤 +843 2 ❌ +2

For more details on these failures, see this check.

Results for commit e1058fe. ± Comparison against base commit 69b798d.

♻️ This comment has been updated with latest results.

crusaderky · 2022-05-27T10:56:12Z

[EDIT] this comment refers to a previous version of the PR and is now obsolete.

The transition log has changed from

           - ('x', 'ensure-task-exists', 'released')
           - ('x', 'released', 'fetch', 'fetch', {})
           - ('gather-dependencies', 'tcp://127.0.0.1:53985', {'x'})
           - ('x', 'fetch', 'flight', 'flight', {})

to

           - ('x', 'ensure-task-exists', 'released'),
           - ('gather-dependencies', 'tcp://127.0.0.1:53985', {'x'}),
           - ('x', 'released', 'fetch', 'fetch', {'x': ('flight', 'tcp://127.0.0.1:53985')}),
           - ('x', 'fetch', 'flight', 'flight', {}),

This... is correct, but it's very counter-intuitive.
What's happening is that

transition_released_fetch starts. It sets ts.state = "fetch", adds it to data_needed, and internally invokes _ensure_communicating.
_ensure_communicating removes ts from data_needed, prints its own log line gather-dependencies, and returns a recommendation to transition to flight, together with a GatherDep instruction
transition_released_fetch returns
_transition logs the released->fetch transition and returns the recommendations, instructions generated by _ensure_communicating
_transitions calls _transition again for fetch->flight and logs the outcome.

Again, all this is correct, but it's confusing; it took me an unhealthy amount of time to figure out why the gather-dependencies log line appeared before ('x', 'released', 'fetch', 'fetch'). Not sure if we can/want to do anything about this?

fjetter · 2022-06-03T08:17:45Z

This... is correct, but it's very counter-intuitive.
What's happening is that

As I'm arguing in #6442 (comment) I believe this log should simply be removed

fjetter · 2022-06-03T10:22:31Z

I could reproduce the issue about the assumptions in the test about which keys to be fetched. This is not entirely obvious from the tests and I consider this a bit concerning.

It's also not about any priorities, ordering, etc. but rather that we're using an unordered set for ts.dependencies which then triggers transitions here

distributed/distributed/worker.py

Lines 2125 to 2129 in 6d85a85

    
           for dep_ts in ts.dependencies: 
        
               if dep_ts.state != "memory": 
        
                   ts.waiting_for_data.add(dep_ts) 
        
                   dep_ts.waiters.add(ts) 
        
                   recommendations[dep_ts] = "fetch"

This randomness was previously buffered by the delayed _ensure_communicating. I don't feel great about introducing randomness to our scheduling (e.g. there is not even a key-based tie breaker). This entire refactoring effort is supposed to help us make things more deterministic.

fjetter · 2022-06-03T11:30:30Z

I opened #6497 with a suggestion on how to move forward with ensure_communicating. Maybe it's worth putting this PR on ice until #6497 is settled

crusaderky · 2022-06-09T17:07:05Z

This has been parked until a new design is agreed upon in #6497.

github-actions · 2022-06-16T15:42:58Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±  0       15 suites ±0 10h 10m 51s ⏱️ +27s
  2 896 tests +  2   2 811 ✔️ +  3   84 💤 ±0 1 ❌ - 1
21 451 runs +14 20 486 ✔️ +15 964 💤 ±0 1 ❌ - 1

For more details on these failures, see this check.

Results for commit 9051170. ± Comparison against base commit 88e1fe0.

♻️ This comment has been updated with latest results.

crusaderky · 2022-06-17T15:49:38Z

The PR has been rewritten from scratch and is now ready for review and merge.
It exacerbates the O(n^2*logn) condition described above; this is fixed in #6587.

fjetter

This is great!

fjetter · 2022-06-24T16:44:52Z

distributed/worker_state_machine.py

-        return merge_recs_instructions(
-            (recommendations, []),
-            self._ensure_communicating(stimulus_id=ev.stimulus_id),
-        )


I love that this is not everywhere anymore ❤️

crusaderky self-assigned this May 26, 2022

crusaderky commented May 26, 2022

View reviewed changes

distributed/tests/test_worker.py Outdated Show resolved Hide resolved

crusaderky force-pushed the WSMR/EnsureCommunicatingAfterTransitions branch 3 times, most recently from 2c32233 to b7a4538 Compare May 30, 2022 14:30

This was referenced May 30, 2022

Rework some tests related to gather_dep #6472

Merged

Yank state machine out of Worker class #6476

Closed

crusaderky linked an issue May 30, 2022 that may be closed by this pull request

Yank state machine out of Worker class #6476

Closed

crusaderky force-pushed the WSMR/EnsureCommunicatingAfterTransitions branch from 201bd67 to 17ce3ef Compare June 1, 2022 13:12

crusaderky marked this pull request as ready for review June 2, 2022 18:37

fjetter mentioned this pull request Jun 3, 2022

Alternatives for current ensure_communicating #6497

Closed

crusaderky removed a link to an issue Jun 6, 2022

Yank state machine out of Worker class #6476

Closed

jrbourbeau mentioned this pull request Jun 7, 2022

Release 2022.6.0 dask/community#252

Closed

9 tasks

crusaderky marked this pull request as draft June 10, 2022 16:32

crusaderky force-pushed the WSMR/EnsureCommunicatingAfterTransitions branch 5 times, most recently from aa5273d to 1190bcc Compare June 16, 2022 14:12

crusaderky mentioned this pull request Jun 16, 2022

Deduplicate data_needed #6587

Merged

crusaderky force-pushed the WSMR/EnsureCommunicatingAfterTransitions branch from 1190bcc to ab0e9a1 Compare June 17, 2022 12:00

Remove EnsureCommunicatingAfterTransitions

5715261

crusaderky force-pushed the WSMR/EnsureCommunicatingAfterTransitions branch from ab0e9a1 to 5715261 Compare June 17, 2022 12:30

tweak test

57168cc

crusaderky marked this pull request as ready for review June 17, 2022 15:39

jsignell mentioned this pull request Jun 20, 2022

Release 2022.6.1 dask/community#258

Closed

9 tasks

crusaderky added 3 commits June 22, 2022 11:01

Merge branch 'main' into WSMR/EnsureCommunicatingAfterTransitions

14adc6a

Merge branch 'main' into WSMR/EnsureCommunicatingAfterTransitions

f8911dd

Use ws fixture

dbd2dc1

crusaderky mentioned this pull request Jun 22, 2022

Adding replicas to a task in fetch now sends it to flight immediately #6594

Merged

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jun 22, 2022

Remove EnsureCommunicatingAfterTransitions (dask#6462)

9578538

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jun 23, 2022

Remove EnsureCommunicatingAfterTransitions (dask#6462)

083a908

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jun 23, 2022

Remove EnsureCommunicatingAfterTransitions (dask#6462)

e3b70da

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jun 23, 2022

Remove EnsureCommunicatingAfterTransitions (dask#6462)

7c40e1b

crusaderky mentioned this pull request Jun 24, 2022

Benchmark WorkerState._ensure_communicating dask/dask-benchmarks#50

Merged

Merge branch 'main' into WSMR/EnsureCommunicatingAfterTransitions

7fd7b36

fjetter approved these changes Jun 24, 2022

View reviewed changes

Merge branch 'main' into WSMR/EnsureCommunicatingAfterTransitions

9051170

crusaderky merged commit 4b24753 into dask:main Jun 26, 2022

crusaderky deleted the WSMR/EnsureCommunicatingAfterTransitions branch June 26, 2022 08:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove EnsureCommunicatingAfterTransitions#6462

Remove EnsureCommunicatingAfterTransitions#6462
crusaderky merged 7 commits intodask:mainfrom
crusaderky:WSMR/EnsureCommunicatingAfterTransitions

crusaderky commented May 26, 2022 •

edited

Loading

Uh oh!

Uh oh!

crusaderky commented May 26, 2022 •

edited

Loading

Uh oh!

github-actions bot commented May 26, 2022 •

edited

Loading

Uh oh!

crusaderky commented May 27, 2022 •

edited

Loading

Uh oh!

fjetter commented Jun 3, 2022

Uh oh!

fjetter commented Jun 3, 2022

Uh oh!

fjetter commented Jun 3, 2022

Uh oh!

crusaderky commented Jun 9, 2022 •

edited

Loading

Uh oh!

github-actions bot commented Jun 16, 2022 •

edited

Loading

Uh oh!

crusaderky commented Jun 17, 2022

Uh oh!

fjetter left a comment

Uh oh!

fjetter Jun 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

crusaderky commented May 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

crusaderky commented May 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

crusaderky commented May 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjetter commented Jun 3, 2022

Uh oh!

fjetter commented Jun 3, 2022

Uh oh!

fjetter commented Jun 3, 2022

Uh oh!

crusaderky commented Jun 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

crusaderky commented Jun 17, 2022

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

fjetter Jun 24, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

crusaderky commented May 26, 2022 •

edited

Loading

crusaderky commented May 26, 2022 •

edited

Loading

github-actions bot commented May 26, 2022 •

edited

Loading

crusaderky commented May 27, 2022 •

edited

Loading

crusaderky commented Jun 9, 2022 •

edited

Loading

github-actions bot commented Jun 16, 2022 •

edited

Loading