Skip to content

[GCS]Use direct getting instead of pub-sub to update load metrics in monitor.py#11339

Merged
ericl merged 25 commits intoray-project:masterfrom
antgroup:monitor
Oct 28, 2020
Merged

[GCS]Use direct getting instead of pub-sub to update load metrics in monitor.py#11339
ericl merged 25 commits intoray-project:masterfrom
antgroup:monitor

Conversation

@WangTaoTheTonic
Copy link
Copy Markdown
Contributor

@WangTaoTheTonic WangTaoTheTonic commented Oct 12, 2020

… mode

Why are these changes needed?

In monitor.py we sub heartbeat batch to get resources of all nodes, feeding to load metric.
while light heartbeat enabled, the heartbeat batch are broadcast partially only when the resources in them were changed.
We should get all heartbeat infos first once monitor sub this pattern, then use this whole resources info as a foundation for incoming update.

Related issue number

part of #10355

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@WangTaoTheTonic WangTaoTheTonic changed the title Init load metric first before sub heartbeat batch, in light heartbeat… [GCS]Init load metric first before sub heartbeat batch, in light heartbeat… Oct 12, 2020
@rkooo567
Copy link
Copy Markdown
Contributor

Does getAllHeartbeat return the heartbeat information at the moment? Just want to confirm it doesn't return all historical heartbeat information. (Maybe you can add some comments to make it clear).

@WangTaoTheTonic
Copy link
Copy Markdown
Contributor Author

Does getAllHeartbeat return the heartbeat information at the moment? Just want to confirm it doesn't return all historical heartbeat information. (Maybe you can add some comments to make it clear).

Yes it returns all newest heartbeat information of all nodes. I'll add some comments.

redis_address, password=redis_password)
self.global_state_accessor = GlobalStateAccessor(
redis_address, redis_password, False)
self.global_state_accessor.connect()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we run _initialize_global_state?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_initialize_global_state is a function in state.py and it init a global state accessor inside but not explode it.

What we need here is init a global state accessor and use it to get heartbeats from gcs, which is not included in state.py

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 14, 2020
@WangTaoTheTonic WangTaoTheTonic removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 15, 2020
@WangTaoTheTonic
Copy link
Copy Markdown
Contributor Author

WangTaoTheTonic commented Oct 16, 2020

PrintLogTest.CallstackTraceTest failed in windows build but don't know why.

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Oct 16, 2020

@WangTaoTheTonic @rkooo567 at a higher level I'm wondering why we don't always call get_all_heartbeat() from the GCS (poll data) rather than subscribing to delta changes. This would considerably simplify the code, and since the monitor.py is only polling directly from the GCS every 10 seconds, the performance should be fine even in a very large cluster.

So basically from raylets -> GCS we have lightweight delta heartbeats at a fine granularity, but from GCS -> autoscaler we always poll for full snapshots of the data every 10 seconds.

@rkooo567 IIRC we earlier discussed switching to a poll-based approach for the autoscaler rather than attempting pub-sub.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 16, 2020
@rkooo567
Copy link
Copy Markdown
Contributor

@WangTaoTheTonic That failure shouldn't be related to your PR. There was one PR that causes this (sorry I merged again without checking windows build carefully). It will be fixed by #11413

Also, +1 for @ericl's suggestion.

@WangTaoTheTonic
Copy link
Copy Markdown
Contributor Author

@ericl @rkooo567 Direct polling heartbeat data in monitor.py would be fine if autoscaler didn't fetch them very often.
I'll simply the codes as suggested.

@WangTaoTheTonic
Copy link
Copy Markdown
Contributor Author

WangTaoTheTonic commented Oct 19, 2020

@WangTaoTheTonic @rkooo567 at a higher level I'm wondering why we don't always call get_all_heartbeat() from the GCS (poll data) rather than subscribing to delta changes. This would considerably simplify the code, and since the monitor.py is only polling directly from the GCS every 10 seconds, the performance should be fine even in a very large cluster.

So basically from raylets -> GCS we have lightweight delta heartbeats at a fine granularity, but from GCS -> autoscaler we always poll for full snapshots of the data every 10 seconds.

@rkooo567 IIRC we earlier discussed switching to a poll-based approach for the autoscaler rather than attempting pub-sub.

Hey Eric! After a second check I found the poll data interval is not 10 seconds but a heartbeat interval(100ms) in monitor.py!

Please verify whether we use directly polling data to replace pub-sub to update or not if the interval is 100ms.

About the interval see codes segment below:

        # Handle messages from the subscription channels.
        while True:
            # Process autoscaling actions
            if self.autoscaler:
                # Only used to update the load metrics for the autoscaler.
                self.update_raylet_map()
                self.autoscaler.update()

            # Process a round of messages.
            self.process_messages()

            # Wait for a heartbeat interval before processing the next round of
            # messages.
            time.sleep(
                ray._config.raylet_heartbeat_timeout_milliseconds() * 1e-3)

@WangTaoTheTonic WangTaoTheTonic removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 19, 2020
@WangTaoTheTonic
Copy link
Copy Markdown
Contributor Author

@ericl

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Oct 19, 2020

Ah, it's throttled to 10.0 seconds in autoscaler.py internally (see the _update function). Because of this, it should be fine (no change in behavior) to reduce the frequency of calling update to every 10 seconds in monitor.py.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 19, 2020
@WangTaoTheTonic WangTaoTheTonic removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 27, 2020
@WangTaoTheTonic
Copy link
Copy Markdown
Contributor Author

WangTaoTheTonic commented Oct 27, 2020

@ericl @rkooo567
Placement group information has been moved into getAllHeartbeat from SendBatchedHeartbeatData. And all todo and test cases are fixed. :)

self.get_all_heartbeat()
# Initialize the subscription channel.
self.psubscribe(ray.gcs_utils.XRAY_HEARTBEAT_BATCH_PATTERN)
self.psubscribe(ray.gcs_utils.XRAY_JOB_PATTERN)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we should just remove the pubsub client entirely now that it isn't really used. It seems the job handler just prints a message.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR is quite large now, maybe file another to remove the pubsub client and job handler is better. I'll do it after this being merged.

Copy link
Copy Markdown
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WangTaoTheTonic this looks great, thanks for making the extra changes here to clean up the protocol. Just one comment on perhaps not resetting placement group load now that we are polling for it infrequently.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 27, 2020
batch->add_batch()->Swap(&heartbeat.second);
}

for (auto &demand : aggregate_load) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we take these logic out from the SendBatchedHeartbeat? I think it won't have any impact because the resource demand and placement group load is only used for the autoscaler. Please lmk if I am wrong @ericl.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nvm. It looks like it is already handled.

Copy link
Copy Markdown
Contributor Author

@WangTaoTheTonic WangTaoTheTonic Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice catch!
We have removed that before in this pr but it came back after merging the master. Only can we do is to remove it again :(

@ericl ericl removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 28, 2020
@WangTaoTheTonic WangTaoTheTonic changed the title [GCS]Init load metric first before sub heartbeat batch, in light heartbeat… [GCS]Use direct getting instead of pub-sub to update load metrics in monitor.py Oct 28, 2020
@WangTaoTheTonic
Copy link
Copy Markdown
Contributor Author

seems failed tests are not related.

@ericl ericl merged commit 1d5694d into ray-project:master Oct 28, 2020
@WangTaoTheTonic WangTaoTheTonic deleted the monitor branch October 29, 2020 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants