-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[rfc] ray health-check #15265
Description
Why
When running a k8s cluster, advanced users need a way of health checking ray and its components. In particular, we want to be able to health check components of the cluster like the ray client server.
Ray almost exclusively uses grpc for its transport layer, and k8s doesn't officially support a grpc based health check.
Proposed API
The proposed API is a command that k8s can use as a liveness-command based health check.
ray health-check # Returns 0 if it can connect to GCS else 1
ray health-check --component=client_server # Return 1 if the value is sufficiently recent.
By default, we assume there is one instance of ray running on the head node (we can support --address=... and --port=... if necessary).
Potential implementation
One potential implementation can rely on internal-kv.
ray health-check (no args) can simply try to connect to GCS KV Service, which is sufficient proof that GCS is alive.
ray health-check --component=client_server can check some internal-kv key healthcheck:client_server for status information. The ray client server should periodically put some heartbeat in internal-kv.
The state of internal kv would look something like
"healthcheck:client_server": "{'last_modified': 123455 # a unix timestamp}"