Skip to content

[Feature] Fail RayJob immediately on deterministic RayCluster init-container failures #4637

@zrss

Description

@zrss

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

This request is not about timeout (activeDeadlineSeconds / preRunningDeadlineSeconds).
It is about failure-signal-based immediate failure for deterministic init errors.

By the way it is a potential bug where a RayJob enters an infinite loop of deleting and recreating the Head/Workers Pod when an initContainer fails, provided that the restartPolicy is set to Never. The Controller immediately (reconcile period) triggers a new Pod creation after the previous one fails, without any exponential backoff mechanism.

Minimal Reproducible Example:

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-failfast-custom-init
spec:
  entrypoint: echo hello
  rayClusterSpec:
    rayVersion: "2.52.0"
    headGroupSpec:
      template:
        spec:
          restartPolicy: Never
          initContainers:
            - name: intentional-fail-init
              image: busybox:1.36
              command: ["sh", "-c", "echo failing on purpose && exit 42"]
          containers:
            - name: ray-head
              image: rayproject/ray:2.52.0

kubectl get po -w

AME                                           READY   STATUS       RESTARTS       AGE
rayjob-failfast-custom-init-k49xs-head-slt2v   0/1     Init:Error   0              1s
rayjob-failfast-custom-init-k49xs-head-slt2v   0/1     Init:Error   0              2s
rayjob-failfast-custom-init-k49xs-head-slt2v   0/1     Init:Error   0              2s
rayjob-failfast-custom-init-k49xs-head-slt2v   0/1     Init:Error   0              2s
rayjob-failfast-custom-init-k49xs-head-gwhjh   0/1     Pending      0              0s
rayjob-failfast-custom-init-k49xs-head-gwhjh   0/1     Pending      0              0s
rayjob-failfast-custom-init-k49xs-head-gwhjh   0/1     Init:0/1     0              0s
rayjob-failfast-custom-init-k49xs-head-gwhjh   0/1     Init:Error   0              1s
rayjob-failfast-custom-init-k49xs-head-gwhjh   0/1     Init:Error   0              2s
rayjob-failfast-custom-init-k49xs-head-gwhjh   0/1     Init:Error   0              2s
rayjob-failfast-custom-init-k49xs-head-gwhjh   0/1     Init:Error   0              2s
rayjob-failfast-custom-init-k49xs-head-v6zrf   0/1     Pending      0              0s
rayjob-failfast-custom-init-k49xs-head-v6zrf   0/1     Pending      0              0s
rayjob-failfast-custom-init-k49xs-head-v6zrf   0/1     Init:0/1     0              0s

Context

Current controls:

  • activeDeadlineSeconds: whole RayJob lifecycle timeout
  • preRunningDeadlineSeconds: pre-running timeout

These are time-based guardrails, but they do not fail fast when init has already deterministically failed.

Request

Please add an optional RayJob fail-fast switch, for example:

spec:
  failFastOnClusterInitContainerFailure: true

When enabled, during pre-running states (e.g. Initializing), if any user-defined init container in RayCluster Pods enters a fatal failure state, RayJob should transition to Failed immediately.

Suggested fatal signals

  • state.terminated.exitCode != 0

Expected result

  • status.jobDeploymentStatus = Failed
  • dedicated failure reason (e.g. ClusterInitContainerFailed)
  • clear message with pod/container/reason
  • warning event emitted for observability

Why

For deterministic init failures, RayJob should fail immediately from orchestration semantics, rather than waiting for a timeout.

Use case

We use custom init containers in RayJob-created RayCluster Pods to perform setup before RayJob starts, for example:

  • downloading datasets from object storage / remote sources,
  • preparing local cache / unpacking artifacts,
  • other mandatory bootstrap steps.
    These steps can fail deterministically (download/auth/network/config errors, script failures, etc.).
    When that happens, we expect the corresponding RayJob to fail immediately, instead of waiting for a time-based deadline.

Related issues

not duplicate

#4525
#4178
#4037
#2735
#2125
#988

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions