-
Notifications
You must be signed in to change notification settings - Fork 731
[Bug] wait-gcs-ready init-container going out-of-memory indefinitely (OOMKilled) #2735
Description
Search before asking
- I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
We are unable to use Ray on Kubernetes, because our workers are crashing from out-of-memories in the wait-gcs-ready init-container. This results in an infinite backoff loop trying to re-run the init-container, but it seems like it will never succeed, and therefore no workers are available.
A kubectl describe ourclustername-cpu-group-worker-2sbdj for instance reveals:
Init Containers:
wait-gcs-ready:
[...]
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 13 Jan 2025 12:17:25 +0100
Finished: Mon, 13 Jan 2025 12:18:07 +0100
Ready: False
Restart Count: 4
Limits:
cpu: 200m
memory: 256Mi
Requests:
cpu: 200m
memory: 256Mi
Note that the upper memory limit of 256 Mi is rather low, and seems to be coming from here:
| corev1.ResourceMemory: resource.MustParse("256Mi"), |
Our assumption is that the pod goes out-of-memory in this line of the script, which tries to invoke the ray CLI:
| if ray health-check --address %s:%s > /dev/null 2>&1; then |
To get a rough estimate of the memory usage of that call, one can check with e.g.:
/usr/bin/time -l ray health-check --address localhost:1234 2>&1 | grep "resident set size"which reveals a resident set sizes of around 180 to 190 MB. Accounting for memory usage from the system, 256 Mi may simply be not enough.
Reproduction script
It doesn't really matter, because it is Kubernetes configuration problem.
But we are basically submitting a simple hello world for testing:
import ray
@ray.remote
def hello_world():
return "hello world"
ray.init()
print(ray.cluster_resources())
print(ray.get(hello_world.remote()))Anything else
How often does the problem occur?
Since the exact amount of allocated memory is non-deterministic, the error also happens non-deterministically for us. Depending on the environment, it seems to fail with different probabilities:
- on our productive cluster it is close to 0% fortunately.
- on our CI kind cluster it fails with ~90%.
- on some developer machines it fail with ~100%.
We do not yet understand why the different environments have such different failure rates.
Are you willing to submit a PR?
- Yes I am willing to submit a PR!