Limit incoming data transfers by amount of data by hendrikmakait · Pull Request #6975 · dask/distributed

hendrikmakait · 2022-08-30T14:34:27Z

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2022-08-30T17:43:35Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±  0       15 suites ±0 6h 30m 32s ⏱️ + 2m 52s
  3 088 tests +  2   3 004 ✔️ +  4   84 💤 - 1 0 ❌ - 1
22 844 runs +10 21 860 ✔️ +12 984 💤 - 1 0 ❌ - 1

Results for commit 0a5517d. ± Comparison against base commit bfc5cfe.

♻️ This comment has been updated with latest results.

distributed/worker_state_machine.py

hendrikmakait · 2022-09-01T16:37:41Z

Note: The current implementation may cross the limit in some instances, i.e., any time we are starting to gather from a new worker. I'm working on some cleaner logic that avoids crossing the limit unless we are not gathering any data. This change may also be moved to a separate PR if we want to get this general change merged.

hendrikmakait · 2022-09-01T18:00:57Z

Note: The current implementation may cross the limit in some instances, i.e., any time we are starting to gather from a new worker. I'm working on some cleaner logic that avoids crossing the limit unless we are not gathering any data. This change may also be moved to a separate PR if we want to get this general change merged.

I adjusted the logic and now we should never exceed the limit unless for the first task to gather to ensure that we make progress.

distributed/worker_state_machine.py

fjetter · 2022-09-02T09:48:58Z

distributed/tests/test_worker_state_machine.py

+    assert ws.tasks["a"].state == "fetch"
+    assert ws.tasks["b"].state == "flight"
+    assert ws.tasks["c"].state == "flight"


This is only deterministic because there is some inherent sorting to the python collections (e.g. dicts are insertion sorted). From first principle, the worker doesn't prefer either of these tasks over the other. I think we should write the test agnostic to this. Or asked differently, would you have known which is scheduled and which is queued before executing the test once?
how about

tasks_by_state = defaultdict(list) for ts in ws.tasks.values(): tasks_by_state[ts.state].append(ts) assert len(tasks_by_state["flight"]) == 2 assert len(tasks_by_state["fetch"]) == 1 # NOTE: We do not compare instructions since their sorting is random ws.handle_stimulus( GatherDepSuccessEvent( worker=ws2, data={ts.key: 123 for ts in tasks_by_state['flight']}, total_nbytes=200, stimulus_id="s2" ) ) assert all(ts.state == "memory" for ts in tasks_by_state['flight']) assert all(ts.state == "flight" for ts in tasks_by_state['fetch'])

Fair point, this makes an effort to hide unnecessary details and limit the assertions to what's important.

fjetter · 2022-09-02T10:01:13Z

distributed/worker_state_machine.py

        to_gather: list[TaskState] = []
        total_nbytes = 0

        while available:
            ts = available.peek()
            # The top-priority task is fetched regardless of its size
-            if (
-                to_gather
-                and total_nbytes + ts.get_nbytes() > self.transfer_message_target_bytes
+            gather_at_least_one_task = self.transfer_incoming_bytes or to_gather
+            exceed_message_target = (
+                total_nbytes + ts.get_nbytes() > self.transfer_message_target_bytes
+            )
+            exceed_bytes_limit = (
+                self.transfer_incoming_bytes_limit is not None
+                and self.transfer_incoming_bytes + total_nbytes + ts.get_nbytes()
+                > self.transfer_incoming_bytes_limit
+            )
+            if gather_at_least_one_task and (
+                exceed_message_target or exceed_bytes_limit


How about

if self.transfer_incoming_bytes_limit: bytes_left_to_fetch = min( self.transfer_incoming_bytes_limit - self.transfer_incoming_bytes, self.transfer_message_target_bytes, ) else: bytes_left_to_fetch = self.transfer_message_target_bytes while available: ts = available.peek() if ( # If there is no other traffic, the top priority task may be # fetched regardless of its size to_gather or self.transfer_incoming_bytes ) and total_nbytes + ts.get_nbytes() > bytes_left_to_fetch: break for worker in ts.who_has: # This also effectively pops from available self.data_needed[worker].remove(ts) to_gather.append(ts) total_nbytes += ts.get_nbytes() return to_gather, total_nbytes

At least subjectively this seems simpler.

Agreed, this looks simpler.

fjetter · 2022-09-02T10:02:11Z

distributed/tests/test_worker_state_machine.py

+    assert ws.tasks["a"].state == "flight"
+    assert ws.tasks["b"].state == "fetch"


similar concern about determinism as above

fjetter · 2022-09-02T10:02:48Z

distributed/tests/test_worker_state_machine.py

+    assert instructions == [
+        GatherDep.match(
+            worker=ws2,
+            to_gather={"a"},
+            stimulus_id="s1",
+        ),
+    ]
+    assert ws.tasks["a"].state == "flight"


I think you should either assert the state or the instruction. Both is redundant.

Removed the instruction checks

fjetter · 2022-09-02T10:03:39Z

distributed/worker.py

+            if self.memory_manager.memory_limit is None
+            else int(
+                self.memory_manager.memory_limit
+                * dask.config.get("distributed.worker.memory.transfer")


according to the distributed-schema.yaml distributed.worker.memory.transfer is allowed to be False. This would raise

Good catch, thanks!

hendrikmakait

Incorporated review feedback

distributed/worker_state_machine.py

hendrikmakait · 2022-09-02T10:25:03Z

distributed/worker.py

+            if self.memory_manager.memory_limit is None
+            else int(
+                self.memory_manager.memory_limit
+                * dask.config.get("distributed.worker.memory.transfer")


Good catch, thanks!

hendrikmakait · 2022-09-02T10:53:11Z

distributed/worker_state_machine.py

        to_gather: list[TaskState] = []
        total_nbytes = 0

        while available:
            ts = available.peek()
            # The top-priority task is fetched regardless of its size
-            if (
-                to_gather
-                and total_nbytes + ts.get_nbytes() > self.transfer_message_target_bytes
+            gather_at_least_one_task = self.transfer_incoming_bytes or to_gather
+            exceed_message_target = (
+                total_nbytes + ts.get_nbytes() > self.transfer_message_target_bytes
+            )
+            exceed_bytes_limit = (
+                self.transfer_incoming_bytes_limit is not None
+                and self.transfer_incoming_bytes + total_nbytes + ts.get_nbytes()
+                > self.transfer_incoming_bytes_limit
+            )
+            if gather_at_least_one_task and (
+                exceed_message_target or exceed_bytes_limit


Agreed, this looks simpler.

hendrikmakait · 2022-09-02T10:53:30Z

distributed/tests/test_worker_state_machine.py

+    assert ws.tasks["a"].state == "flight"
+    assert ws.tasks["b"].state == "fetch"


hendrikmakait · 2022-09-02T10:53:42Z

distributed/tests/test_worker_state_machine.py

+    assert instructions == [
+        GatherDep.match(
+            worker=ws2,
+            to_gather={"a"},
+            stimulus_id="s1",
+        ),
+    ]
+    assert ws.tasks["a"].state == "flight"


Removed the instruction checks

hendrikmakait · 2022-09-02T10:54:43Z

distributed/tests/test_worker_state_machine.py

+    assert ws.tasks["a"].state == "fetch"
+    assert ws.tasks["b"].state == "flight"
+    assert ws.tasks["c"].state == "flight"


Fair point, this makes an effort to hide unnecessary details and limit the assertions to what's important.

distributed/tests/test_worker_state_machine.py

Co-authored-by: Florian Jetter <fjetter@users.noreply.github.com>

distributed/distributed-schema.yaml

distributed/distributed.yaml

This reverts commit 6da758b.

hendrikmakait added 2 commits August 30, 2022 21:46

Implement relative comms limit

f63c95f

Handle no memory limit

0275f99

hendrikmakait force-pushed the limit-comm-bytes branch from 141a9de to 0275f99 Compare August 30, 2022 19:49

hendrikmakait added 2 commits August 30, 2022 21:52

Adjust names

3eb1758

Minor

ac7ebfe

hendrikmakait force-pushed the limit-comm-bytes branch from 732bdf5 to ac7ebfe Compare August 30, 2022 19:54

fjetter reviewed Aug 31, 2022

View reviewed changes

distributed/worker_state_machine.py Show resolved Hide resolved

hendrikmakait added 6 commits August 31, 2022 11:47

Refactor _should_throttle_incoming_transfer for readability

7f11311

Add docstring

66006b0

Merge branch 'main' into limit-comm-bytes

dd6d4f5

Fix throttle condition

f83e715

Add schema description

dba306d

minor

0501531

hendrikmakait marked this pull request as ready for review September 1, 2022 15:13

Fix issue where we exceeded the limit

39cf6ae

Refactor logic for readability

09154fd

fjetter requested changes Sep 2, 2022

View reviewed changes

hendrikmakait added 3 commits September 2, 2022 12:24

Fix config parsing

966db91

Improve tests

7b1b900

Improve readability of _select_keys_for_gather

04a79d2

hendrikmakait commented Sep 2, 2022

View reviewed changes

fjetter reviewed Sep 2, 2022

View reviewed changes

distributed/tests/test_worker_state_machine.py Outdated Show resolved Hide resolved

Update distributed/tests/test_worker_state_machine.py

111c602

Co-authored-by: Florian Jetter <fjetter@users.noreply.github.com>

fjetter approved these changes Sep 2, 2022

View reviewed changes

Fix test

7f27419

hendrikmakait mentioned this pull request Sep 2, 2022

Add dashboard component for size of open data transfers #6982

Merged

2 tasks

crusaderky approved these changes Sep 2, 2022

View reviewed changes

distributed/distributed-schema.yaml Outdated Show resolved Hide resolved

distributed/distributed-schema.yaml Outdated Show resolved Hide resolved

distributed/distributed-schema.yaml Outdated Show resolved Hide resolved

distributed/distributed.yaml Show resolved Hide resolved

crusaderky added 4 commits September 2, 2022 14:54

Update distributed/distributed-schema.yaml

9e93f95

Update distributed/distributed-schema.yaml

c950ba4

Update distributed/distributed-schema.yaml

2055856

Update distributed/distributed.yaml

0a5517d

hendrikmakait self-assigned this Sep 2, 2022

fjetter merged commit 6da758b into dask:main Sep 2, 2022

fjetter added a commit that referenced this pull request Sep 2, 2022

Revert "Limit incoming data transfers by amount of data (#6975)"

2a7b546

This reverts commit 6da758b.

fjetter mentioned this pull request Sep 2, 2022

Revert "Limit incoming data transfers by amount of data" #6994

Merged

jrbourbeau mentioned this pull request Sep 2, 2022

Worker can memory overflow by fetching too much data at once #6208

Closed

fjetter mentioned this pull request Sep 6, 2022

"Limit incoming data transfers by amount of data" (#69… #7007

Merged

gjoseph92 mentioned this pull request Sep 6, 2022

⚠️ CI failed ⚠️ coiled/benchmarks#299

Closed

wence- mentioned this pull request Sep 21, 2022

25% performance regression in merges #7052

Closed

gjoseph92 pushed a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022

Limit incoming data transfers by amount of data (dask#6975)

d2710f9

		assert ws.tasks["a"].state == "flight"
		assert ws.tasks["b"].state == "fetch"

Uh oh!

Conversation

hendrikmakait commented Aug 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

Uh oh!

hendrikmakait commented Sep 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hendrikmakait commented Sep 1, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hendrikmakait left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hendrikmakait commented Aug 30, 2022 •

edited

Loading

github-actions bot commented Aug 30, 2022 •

edited

Loading

hendrikmakait commented Sep 1, 2022 •

edited

Loading