Skip to content

[rfc] ray health-check  #15265

@wuisawesome

Description

@wuisawesome

Why

When running a k8s cluster, advanced users need a way of health checking ray and its components. In particular, we want to be able to health check components of the cluster like the ray client server.

Ray almost exclusively uses grpc for its transport layer, and k8s doesn't officially support a grpc based health check.

Proposed API

The proposed API is a command that k8s can use as a liveness-command based health check.

ray health-check # Returns 0 if it can connect to GCS else 1

ray health-check --component=client_server # Return 1 if the value is sufficiently recent.

By default, we assume there is one instance of ray running on the head node (we can support --address=... and --port=... if necessary).

Potential implementation

One potential implementation can rely on internal-kv.

ray health-check (no args) can simply try to connect to GCS KV Service, which is sufficient proof that GCS is alive.

ray health-check --component=client_server can check some internal-kv key healthcheck:client_server for status information. The ray client server should periodically put some heartbeat in internal-kv.

The state of internal kv would look something like

"healthcheck:client_server": "{'last_modified': 123455 # a unix timestamp}"

Metadata

Metadata

Assignees

Labels

enhancementRequest for new feature and/or capability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions