[GCS]Use direct getting instead of pub-sub to update load metrics in monitor.py by WangTaoTheTonic · Pull Request #11339 · ray-project/ray

WangTaoTheTonic · 2020-10-12T08:29:32Z

… mode

Why are these changes needed?

In monitor.py we sub heartbeat batch to get resources of all nodes, feeding to load metric.
while light heartbeat enabled, the heartbeat batch are broadcast partially only when the resources in them were changed.
We should get all heartbeat infos first once monitor sub this pattern, then use this whole resources info as a foundation for incoming update.

Related issue number

part of #10355

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

… mode

src/ray/gcs/gcs_server/gcs_node_manager.cc

src/ray/gcs/gcs_client/global_state_accessor.cc

rkooo567 · 2020-10-14T02:38:58Z

Does getAllHeartbeat return the heartbeat information at the moment? Just want to confirm it doesn't return all historical heartbeat information. (Maybe you can add some comments to make it clear).

WangTaoTheTonic · 2020-10-14T03:20:02Z

Does getAllHeartbeat return the heartbeat information at the moment? Just want to confirm it doesn't return all historical heartbeat information. (Maybe you can add some comments to make it clear).

Yes it returns all newest heartbeat information of all nodes. I'll add some comments.

rkooo567 · 2020-10-14T05:16:29Z

python/ray/monitor.py

            redis_address, password=redis_password)
+        self.global_state_accessor = GlobalStateAccessor(
+            redis_address, redis_password, False)
+        self.global_state_accessor.connect()


Why don't we run _initialize_global_state?

_initialize_global_state is a function in state.py and it init a global state accessor inside but not explode it.

What we need here is init a global state accessor and use it to get heartbeats from gcs, which is not included in state.py

python/ray/includes/global_state_accessor.pxi

WangTaoTheTonic · 2020-10-16T02:01:26Z

PrintLogTest.CallstackTraceTest failed in windows build but don't know why.

ericl · 2020-10-16T20:43:25Z

@WangTaoTheTonic @rkooo567 at a higher level I'm wondering why we don't always call get_all_heartbeat() from the GCS (poll data) rather than subscribing to delta changes. This would considerably simplify the code, and since the monitor.py is only polling directly from the GCS every 10 seconds, the performance should be fine even in a very large cluster.

So basically from raylets -> GCS we have lightweight delta heartbeats at a fine granularity, but from GCS -> autoscaler we always poll for full snapshots of the data every 10 seconds.

@rkooo567 IIRC we earlier discussed switching to a poll-based approach for the autoscaler rather than attempting pub-sub.

rkooo567 · 2020-10-16T20:48:33Z

@WangTaoTheTonic That failure shouldn't be related to your PR. There was one PR that causes this (sorry I merged again without checking windows build carefully). It will be fixed by #11413

Also, +1 for @ericl's suggestion.

WangTaoTheTonic · 2020-10-17T14:54:35Z

@ericl @rkooo567 Direct polling heartbeat data in monitor.py would be fine if autoscaler didn't fetch them very often.
I'll simply the codes as suggested.

WangTaoTheTonic · 2020-10-19T02:41:09Z

@WangTaoTheTonic @rkooo567 at a higher level I'm wondering why we don't always call get_all_heartbeat() from the GCS (poll data) rather than subscribing to delta changes. This would considerably simplify the code, and since the monitor.py is only polling directly from the GCS every 10 seconds, the performance should be fine even in a very large cluster.

So basically from raylets -> GCS we have lightweight delta heartbeats at a fine granularity, but from GCS -> autoscaler we always poll for full snapshots of the data every 10 seconds.

@rkooo567 IIRC we earlier discussed switching to a poll-based approach for the autoscaler rather than attempting pub-sub.

Hey Eric! After a second check I found the poll data interval is not 10 seconds but a heartbeat interval(100ms) in monitor.py!

Please verify whether we use directly polling data to replace pub-sub to update or not if the interval is 100ms.

About the interval see codes segment below:

        # Handle messages from the subscription channels.
        while True:
            # Process autoscaling actions
            if self.autoscaler:
                # Only used to update the load metrics for the autoscaler.
                self.update_raylet_map()
                self.autoscaler.update()

            # Process a round of messages.
            self.process_messages()

            # Wait for a heartbeat interval before processing the next round of
            # messages.
            time.sleep(
                ray._config.raylet_heartbeat_timeout_milliseconds() * 1e-3)

WangTaoTheTonic · 2020-10-19T03:12:23Z

@ericl

ericl · 2020-10-19T22:55:16Z

Ah, it's throttled to 10.0 seconds in autoscaler.py internally (see the _update function). Because of this, it should be fine (no change in behavior) to reduce the frequency of calling update to every 10 seconds in monitor.py.

…rease update interval

WangTaoTheTonic · 2020-10-27T08:20:07Z

@ericl @rkooo567
Placement group information has been moved into getAllHeartbeat from SendBatchedHeartbeatData. And all todo and test cases are fixed. :)

python/ray/monitor.py

ericl · 2020-10-27T22:20:30Z

python/ray/monitor.py

+        self.get_all_heartbeat()
        # Initialize the subscription channel.
-        self.psubscribe(ray.gcs_utils.XRAY_HEARTBEAT_BATCH_PATTERN)
        self.psubscribe(ray.gcs_utils.XRAY_JOB_PATTERN)


Wondering if we should just remove the pubsub client entirely now that it isn't really used. It seems the job handler just prints a message.

this PR is quite large now, maybe file another to remove the pubsub client and job handler is better. I'll do it after this being merged.

src/ray/gcs/gcs_server/gcs_node_manager.cc

ericl

@WangTaoTheTonic this looks great, thanks for making the extra changes here to clean up the protocol. Just one comment on perhaps not resetting placement group load now that we are polling for it infrequently.

rkooo567 · 2020-10-28T04:16:43Z

src/ray/gcs/gcs_server/gcs_node_manager.cc

+      batch->add_batch()->Swap(&heartbeat.second);
+    }
+
+    for (auto &demand : aggregate_load) {


Can we take these logic out from the SendBatchedHeartbeat? I think it won't have any impact because the resource demand and placement group load is only used for the autoscaler. Please lmk if I am wrong @ericl.

Oh nvm. It looks like it is already handled.

This is a nice catch!
We have removed that before in this pr but it came back after merging the master. Only can we do is to remove it again :(

WangTaoTheTonic · 2020-10-28T14:20:16Z

seems failed tests are not related.

Init load metric first before sub heartbeat batch, in light heartbeat…

d618e1a

… mode

WangTaoTheTonic changed the title ~~Init load metric first before sub heartbeat batch, in light heartbeat…~~ [GCS]Init load metric first before sub heartbeat batch, in light heartbeat… Oct 12, 2020

WangTaoTheTonic added 2 commits October 12, 2020 16:44

update raylet map first

1f1b25f

lint

e0fc791

rkooo567 assigned rkooo567 and ericl Oct 12, 2020

merge master

d588324

WangTaoTheTonic requested review from ericl and rkooo567 October 13, 2020 08:48

ffbin reviewed Oct 14, 2020

View reviewed changes

src/ray/gcs/gcs_server/gcs_node_manager.cc Outdated Show resolved Hide resolved

ffbin reviewed Oct 14, 2020

View reviewed changes

src/ray/gcs/gcs_server/gcs_node_manager.cc Outdated Show resolved Hide resolved

ffbin reviewed Oct 14, 2020

View reviewed changes

src/ray/gcs/gcs_client/global_state_accessor.cc Outdated Show resolved Hide resolved

rkooo567 reviewed Oct 14, 2020

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 14, 2020

add comments, test case

d6004ed

WangTaoTheTonic removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 15, 2020

Merge branch 'master' into monitor

831f328

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 16, 2020

WangTaoTheTonic removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 19, 2020

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 19, 2020

use getting all heartbeat instead of sub to update resources, and inc…

0f9ec2b

…rease update interval

WangTaoTheTonic added 3 commits October 27, 2020 10:45

fix test global state

65086e4

fix test

b545843

fix para

844fd80

WangTaoTheTonic removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 27, 2020

WangTaoTheTonic mentioned this pull request Oct 27, 2020

[GCS]Decouple node failure detector with resoure related operations #11465

Merged

6 tasks

ericl reviewed Oct 27, 2020

View reviewed changes

python/ray/monitor.py Outdated Show resolved Hide resolved

ericl reviewed Oct 27, 2020

View reviewed changes

src/ray/gcs/gcs_server/gcs_node_manager.cc Outdated Show resolved Hide resolved

ericl reviewed Oct 27, 2020

View reviewed changes

src/ray/gcs/gcs_server/gcs_node_manager.cc Outdated Show resolved Hide resolved

ericl reviewed Oct 27, 2020

View reviewed changes

src/ray/gcs/gcs_server/gcs_node_manager.cc Outdated Show resolved Hide resolved

ericl requested changes Oct 27, 2020

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 27, 2020

WangTaoTheTonic added 5 commits October 28, 2020 10:59

merge master to resolve conflicts

e3d62fa

handle placement group correctly

8b4b524

remove bool flag of load metrics update

e402f67

adding up missing ones

372a497

remove unused flag

f77388e

rkooo567 reviewed Oct 28, 2020

View reviewed changes

WangTaoTheTonic added 3 commits October 28, 2020 13:49

fix tests

1417e97

remove pg related in sendbatchedheartbeat, again

ffb7aa4

Remove missing flag

b8eeb9d

ericl approved these changes Oct 28, 2020

View reviewed changes

ericl removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 28, 2020

WangTaoTheTonic changed the title ~~[GCS]Init load metric first before sub heartbeat batch, in light heartbeat…~~ [GCS]Use direct getting instead of pub-sub to update load metrics in monitor.py Oct 28, 2020

ericl merged commit 1d5694d into ray-project:master Oct 28, 2020

WangTaoTheTonic deleted the monitor branch October 29, 2020 01:35

WangTaoTheTonic mentioned this pull request Oct 29, 2020

[GCS]Open lightweight heartbeat by default #10355

Closed

WangTaoTheTonic added this to the Ray scalability and stability milestone Jan 13, 2021

Conversation

WangTaoTheTonic commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rkooo567 commented Oct 14, 2020

Uh oh!

WangTaoTheTonic commented Oct 14, 2020

Uh oh!

rkooo567 Oct 14, 2020

Choose a reason for hiding this comment

Uh oh!

WangTaoTheTonic Oct 14, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WangTaoTheTonic commented Oct 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericl commented Oct 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkooo567 commented Oct 16, 2020

Uh oh!

WangTaoTheTonic commented Oct 17, 2020

Uh oh!

WangTaoTheTonic commented Oct 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WangTaoTheTonic commented Oct 19, 2020

Uh oh!

ericl commented Oct 19, 2020

Uh oh!

WangTaoTheTonic commented Oct 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ericl Oct 27, 2020

Choose a reason for hiding this comment

Uh oh!

WangTaoTheTonic Oct 28, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

rkooo567 Oct 28, 2020

Choose a reason for hiding this comment

Uh oh!

rkooo567 Oct 28, 2020

Choose a reason for hiding this comment

Uh oh!

WangTaoTheTonic Oct 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WangTaoTheTonic commented Oct 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

WangTaoTheTonic commented Oct 12, 2020 •

edited

Loading

WangTaoTheTonic commented Oct 16, 2020 •

edited

Loading

ericl commented Oct 16, 2020 •

edited

Loading

WangTaoTheTonic commented Oct 19, 2020 •

edited

Loading

WangTaoTheTonic commented Oct 27, 2020 •

edited

Loading

WangTaoTheTonic Oct 28, 2020 •

edited

Loading