-
Notifications
You must be signed in to change notification settings - Fork 731
[Feature] Fail RayJob immediately on deterministic RayCluster init-container failures #4637
Description
Search before asking
- I had searched in the issues and found no similar feature requirement.
Description
This request is not about timeout (activeDeadlineSeconds / preRunningDeadlineSeconds).
It is about failure-signal-based immediate failure for deterministic init errors.
By the way it is a potential bug where a RayJob enters an infinite loop of deleting and recreating the Head/Workers Pod when an initContainer fails, provided that the restartPolicy is set to Never. The Controller immediately (reconcile period) triggers a new Pod creation after the previous one fails, without any exponential backoff mechanism.
Minimal Reproducible Example:
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: rayjob-failfast-custom-init
spec:
entrypoint: echo hello
rayClusterSpec:
rayVersion: "2.52.0"
headGroupSpec:
template:
spec:
restartPolicy: Never
initContainers:
- name: intentional-fail-init
image: busybox:1.36
command: ["sh", "-c", "echo failing on purpose && exit 42"]
containers:
- name: ray-head
image: rayproject/ray:2.52.0kubectl get po -w
AME READY STATUS RESTARTS AGE
rayjob-failfast-custom-init-k49xs-head-slt2v 0/1 Init:Error 0 1s
rayjob-failfast-custom-init-k49xs-head-slt2v 0/1 Init:Error 0 2s
rayjob-failfast-custom-init-k49xs-head-slt2v 0/1 Init:Error 0 2s
rayjob-failfast-custom-init-k49xs-head-slt2v 0/1 Init:Error 0 2s
rayjob-failfast-custom-init-k49xs-head-gwhjh 0/1 Pending 0 0s
rayjob-failfast-custom-init-k49xs-head-gwhjh 0/1 Pending 0 0s
rayjob-failfast-custom-init-k49xs-head-gwhjh 0/1 Init:0/1 0 0s
rayjob-failfast-custom-init-k49xs-head-gwhjh 0/1 Init:Error 0 1s
rayjob-failfast-custom-init-k49xs-head-gwhjh 0/1 Init:Error 0 2s
rayjob-failfast-custom-init-k49xs-head-gwhjh 0/1 Init:Error 0 2s
rayjob-failfast-custom-init-k49xs-head-gwhjh 0/1 Init:Error 0 2s
rayjob-failfast-custom-init-k49xs-head-v6zrf 0/1 Pending 0 0s
rayjob-failfast-custom-init-k49xs-head-v6zrf 0/1 Pending 0 0s
rayjob-failfast-custom-init-k49xs-head-v6zrf 0/1 Init:0/1 0 0sContext
Current controls:
activeDeadlineSeconds: whole RayJob lifecycle timeoutpreRunningDeadlineSeconds: pre-running timeout
These are time-based guardrails, but they do not fail fast when init has already deterministically failed.
Request
Please add an optional RayJob fail-fast switch, for example:
spec:
failFastOnClusterInitContainerFailure: trueWhen enabled, during pre-running states (e.g. Initializing), if any user-defined init container in RayCluster Pods enters a fatal failure state, RayJob should transition to Failed immediately.
Suggested fatal signals
state.terminated.exitCode != 0
Expected result
status.jobDeploymentStatus = Failed- dedicated failure reason (e.g.
ClusterInitContainerFailed) - clear message with pod/container/reason
- warning event emitted for observability
Why
For deterministic init failures, RayJob should fail immediately from orchestration semantics, rather than waiting for a timeout.
Use case
We use custom init containers in RayJob-created RayCluster Pods to perform setup before RayJob starts, for example:
- downloading datasets from object storage / remote sources,
- preparing local cache / unpacking artifacts,
- other mandatory bootstrap steps.
These steps can fail deterministically (download/auth/network/config errors, script failures, etc.).
When that happens, we expect the corresponding RayJob to fail immediately, instead of waiting for a time-based deadline.
Related issues
not duplicate
#4525
#4178
#4037
#2735
#2125
#988
Are you willing to submit a PR?
- Yes I am willing to submit a PR!