[GCS]Separate heartbeat and resource updating

### Describe your feature request

Currently raylet reports HEARTBEAT messages periodically, in which resources of node is included. HEARTBEAT message is for two purpose: node liveness probe and resource updating.

Mixing up simple heartbeat with resource reporting brings 3 issues:
1. System requires different frequency for liveness probing and resource updating. Now in ray we set reporting period to 100 ms, which is proper for resource updating but not for liveness probe, as node would be treated as dead after 30 seconds(300 * 100 ms) timeout. Too big or too small interval would not proper for both two.
2. For codes maintainability, a heartbeat message is handled in two places: node manager and node failure detector, which mixed up the resource related operations and liveness probe thing.
3. Node liveness probing is a time sensitive action. It should reflect real node living or dying, or it makes non sense. From this point it should be handled in separate thread in gcs server. Because now heartbeat message not only for liveness probe but also updating resources, it makes us to jump between threads. Besides the heartbeat message is passed through node manager thread before being posted into node failure detector thread, which is not a REAL separate handler.

Based this, we wanna:
1) Separate resources related fields in HEARTBEAT message, but use a new message named like `ReportTimelyResourcesRequest/Response`.
2) Split handle logic in raylet, it would use one tick to send heartbeat message, another one to send timely resources(only send if resource changed if light reporting enabled). These two uses different interval setting.
3) Separate handler for heartbeat and resources updating. Name a new handler and a new grpc service for heartbeat, leave resources updating logic in gcs node manager.
4) Use a separate for heartbeat handler, which makes node failure detecting independent from gcs node manager.

This could:
a. reduce wrong judging of node failure.
b. reduce cpu load of gcs if heartbeat interval a set to a big value and resource changing on node doesn't vary frequently(which should be a normal scenario).

Actions
- [x] Move resources related operations out of node failure detector. https://github.com/ray-project/ray/pull/11465
- [ ] Split HEARTBEAT messages into heartbeat and resource reporting
- [ ] Separate single handler for heartbeat.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GCS]Separate heartbeat and resource updating #11606

Describe your feature request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[GCS]Separate heartbeat and resource updating #11606

Description

Describe your feature request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions