[GCS]Use direct getting instead of pub-sub to update load metrics in monitor.py#11339
[GCS]Use direct getting instead of pub-sub to update load metrics in monitor.py#11339ericl merged 25 commits intoray-project:masterfrom
Conversation
|
Does |
Yes it returns all newest heartbeat information of all nodes. I'll add some comments. |
| redis_address, password=redis_password) | ||
| self.global_state_accessor = GlobalStateAccessor( | ||
| redis_address, redis_password, False) | ||
| self.global_state_accessor.connect() |
There was a problem hiding this comment.
Why don't we run _initialize_global_state?
There was a problem hiding this comment.
_initialize_global_state is a function in state.py and it init a global state accessor inside but not explode it.
What we need here is init a global state accessor and use it to get heartbeats from gcs, which is not included in state.py
|
|
|
@WangTaoTheTonic @rkooo567 at a higher level I'm wondering why we don't always call get_all_heartbeat() from the GCS (poll data) rather than subscribing to delta changes. This would considerably simplify the code, and since the monitor.py is only polling directly from the GCS every 10 seconds, the performance should be fine even in a very large cluster. So basically from @rkooo567 IIRC we earlier discussed switching to a poll-based approach for the autoscaler rather than attempting pub-sub. |
|
@WangTaoTheTonic That failure shouldn't be related to your PR. There was one PR that causes this (sorry I merged again without checking windows build carefully). It will be fixed by #11413 Also, +1 for @ericl's suggestion. |
Hey Eric! After a second check I found the poll data interval is not 10 seconds but a heartbeat interval(100ms) in Please verify whether we use directly polling data to replace pub-sub to update or not if the interval is 100ms. About the interval see codes segment below: |
|
Ah, it's throttled to 10.0 seconds in autoscaler.py internally (see the _update function). Because of this, it should be fine (no change in behavior) to reduce the frequency of calling update to every 10 seconds in monitor.py. |
…rease update interval
| self.get_all_heartbeat() | ||
| # Initialize the subscription channel. | ||
| self.psubscribe(ray.gcs_utils.XRAY_HEARTBEAT_BATCH_PATTERN) | ||
| self.psubscribe(ray.gcs_utils.XRAY_JOB_PATTERN) |
There was a problem hiding this comment.
Wondering if we should just remove the pubsub client entirely now that it isn't really used. It seems the job handler just prints a message.
There was a problem hiding this comment.
this PR is quite large now, maybe file another to remove the pubsub client and job handler is better. I'll do it after this being merged.
ericl
left a comment
There was a problem hiding this comment.
@WangTaoTheTonic this looks great, thanks for making the extra changes here to clean up the protocol. Just one comment on perhaps not resetting placement group load now that we are polling for it infrequently.
| batch->add_batch()->Swap(&heartbeat.second); | ||
| } | ||
|
|
||
| for (auto &demand : aggregate_load) { |
There was a problem hiding this comment.
Can we take these logic out from the SendBatchedHeartbeat? I think it won't have any impact because the resource demand and placement group load is only used for the autoscaler. Please lmk if I am wrong @ericl.
There was a problem hiding this comment.
Oh nvm. It looks like it is already handled.
There was a problem hiding this comment.
This is a nice catch!
We have removed that before in this pr but it came back after merging the master. Only can we do is to remove it again :(
|
seems failed tests are not related. |
… mode
Why are these changes needed?
In
monitor.pywe sub heartbeat batch to get resources of all nodes, feeding to load metric.while light heartbeat enabled, the heartbeat batch are broadcast partially only when the resources in them were changed.
We should get all heartbeat infos first once monitor sub this pattern, then use this whole resources info as a foundation for incoming update.
Related issue number
part of #10355
Checks
scripts/format.shto lint the changes in this PR.