[Bug] `wait-gcs-ready` init-container going out-of-memory indefinitely (OOMKilled)

### Search before asking

- [X] I searched the [issues](https://github.com/ray-project/kuberay/issues) and found no similar issues.


### KubeRay Component

ray-operator

### What happened + What you expected to happen

We are unable to use Ray on Kubernetes, because our workers are crashing from out-of-memories in the `wait-gcs-ready` init-container. This results in an infinite backoff loop trying to re-run the init-container, but it seems like it will never succeed, and therefore no workers are available.

A `kubectl describe ourclustername-cpu-group-worker-2sbdj` for instance reveals:

```
Init Containers:
  wait-gcs-ready:
    [...]
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 13 Jan 2025 12:17:25 +0100
      Finished:     Mon, 13 Jan 2025 12:18:07 +0100
    Ready:          False
    Restart Count:  4
    Limits:
      cpu:     200m
      memory:  256Mi
    Requests:
      cpu:     200m
      memory:  256Mi
```

Note that the upper memory limit of 256 Mi is rather low, and seems to be coming from here:

https://github.com/ray-project/kuberay/blob/9068102246eeb5ab9d9e0b9a7480618d3f348686/ray-operator/controllers/ray/common/pod.go#L222

Our assumption is that the pod goes out-of-memory in this line of the script, which tries to invoke the `ray` CLI:

https://github.com/ray-project/kuberay/blob/9068102246eeb5ab9d9e0b9a7480618d3f348686/ray-operator/controllers/ray/common/pod.go#L192

To get a rough estimate of the memory usage of that call, one can check with e.g.:

```sh
/usr/bin/time -l ray health-check --address localhost:1234 2>&1 | grep "resident set size"
```

which reveals a resident set sizes of around 180 to 190 MB. Accounting for memory usage from the system, 256 Mi may simply be not enough.

### Reproduction script

It doesn't really matter, because it is Kubernetes configuration problem.

But we are basically submitting a simple hello world for testing:

```py
import ray

@ray.remote
def hello_world():
    return "hello world"

ray.init()
print(ray.cluster_resources())
print(ray.get(hello_world.remote()))
```

### Anything else

_How often does the problem occur?_

Since the exact amount of allocated memory is non-deterministic, the error also happens non-deterministically for us. Depending on the environment, it seems to fail with different probabilities:

- on our productive cluster it is close to 0% fortunately.
- on our CI kind cluster it fails with ~90%.
- on some developer machines it fail with ~100%.

We do not yet understand why the different environments have such different failure rates.

### Are you willing to submit a PR?

- [X] Yes I am willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] `wait-gcs-ready` init-container going out-of-memory indefinitely (OOMKilled) #2735

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] wait-gcs-ready init-container going out-of-memory indefinitely (OOMKilled) #2735

Description

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug] `wait-gcs-ready` init-container going out-of-memory indefinitely (OOMKilled) #2735