Skip to content

[Bug] wait-gcs-ready init-container going out-of-memory indefinitely (OOMKilled) #2735

@bluenote10

Description

@bluenote10

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

We are unable to use Ray on Kubernetes, because our workers are crashing from out-of-memories in the wait-gcs-ready init-container. This results in an infinite backoff loop trying to re-run the init-container, but it seems like it will never succeed, and therefore no workers are available.

A kubectl describe ourclustername-cpu-group-worker-2sbdj for instance reveals:

Init Containers:
  wait-gcs-ready:
    [...]
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 13 Jan 2025 12:17:25 +0100
      Finished:     Mon, 13 Jan 2025 12:18:07 +0100
    Ready:          False
    Restart Count:  4
    Limits:
      cpu:     200m
      memory:  256Mi
    Requests:
      cpu:     200m
      memory:  256Mi

Note that the upper memory limit of 256 Mi is rather low, and seems to be coming from here:

corev1.ResourceMemory: resource.MustParse("256Mi"),

Our assumption is that the pod goes out-of-memory in this line of the script, which tries to invoke the ray CLI:

if ray health-check --address %s:%s > /dev/null 2>&1; then

To get a rough estimate of the memory usage of that call, one can check with e.g.:

/usr/bin/time -l ray health-check --address localhost:1234 2>&1 | grep "resident set size"

which reveals a resident set sizes of around 180 to 190 MB. Accounting for memory usage from the system, 256 Mi may simply be not enough.

Reproduction script

It doesn't really matter, because it is Kubernetes configuration problem.

But we are basically submitting a simple hello world for testing:

import ray

@ray.remote
def hello_world():
    return "hello world"

ray.init()
print(ray.cluster_resources())
print(ray.get(hello_world.remote()))

Anything else

How often does the problem occur?

Since the exact amount of allocated memory is non-deterministic, the error also happens non-deterministically for us. Depending on the environment, it seems to fail with different probabilities:

  • on our productive cluster it is close to 0% fortunately.
  • on our CI kind cluster it fails with ~90%.
  • on some developer machines it fail with ~100%.

We do not yet understand why the different environments have such different failure rates.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstabilityPertains to basic infrastructure stability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions