Skip to content

[GCS]Separate heartbeat and resource updating #11606

@WangTaoTheTonic

Description

@WangTaoTheTonic

Describe your feature request

Currently raylet reports HEARTBEAT messages periodically, in which resources of node is included. HEARTBEAT message is for two purpose: node liveness probe and resource updating.

Mixing up simple heartbeat with resource reporting brings 3 issues:

  1. System requires different frequency for liveness probing and resource updating. Now in ray we set reporting period to 100 ms, which is proper for resource updating but not for liveness probe, as node would be treated as dead after 30 seconds(300 * 100 ms) timeout. Too big or too small interval would not proper for both two.
  2. For codes maintainability, a heartbeat message is handled in two places: node manager and node failure detector, which mixed up the resource related operations and liveness probe thing.
  3. Node liveness probing is a time sensitive action. It should reflect real node living or dying, or it makes non sense. From this point it should be handled in separate thread in gcs server. Because now heartbeat message not only for liveness probe but also updating resources, it makes us to jump between threads. Besides the heartbeat message is passed through node manager thread before being posted into node failure detector thread, which is not a REAL separate handler.

Based this, we wanna:

  1. Separate resources related fields in HEARTBEAT message, but use a new message named like ReportTimelyResourcesRequest/Response.
  2. Split handle logic in raylet, it would use one tick to send heartbeat message, another one to send timely resources(only send if resource changed if light reporting enabled). These two uses different interval setting.
  3. Separate handler for heartbeat and resources updating. Name a new handler and a new grpc service for heartbeat, leave resources updating logic in gcs node manager.
  4. Use a separate for heartbeat handler, which makes node failure detecting independent from gcs node manager.

This could:
a. reduce wrong judging of node failure.
b. reduce cpu load of gcs if heartbeat interval a set to a big value and resource changing on node doesn't vary frequently(which should be a normal scenario).

Actions

Metadata

Metadata

Labels

P1Issue that should be fixed within a few weeksenhancementRequest for new feature and/or capability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions