[Feature] Fail RayJob immediately on deterministic RayCluster init-container failures

### Search before asking

- [x] I had searched in the [issues](https://github.com/ray-project/kuberay/issues) and found no similar feature requirement.


### Description

This request is **not** about timeout (`activeDeadlineSeconds` / `preRunningDeadlineSeconds`).  
It is about **failure-signal-based immediate failure** for deterministic init errors.

By the way it is a potential bug where a RayJob enters an infinite loop of deleting and recreating the Head/Workers Pod when an initContainer fails, provided that the restartPolicy is set to Never. The Controller immediately (reconcile period) triggers a new Pod creation after the previous one fails, without any exponential backoff mechanism.

Minimal Reproducible Example:

```yaml
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-failfast-custom-init
spec:
  entrypoint: echo hello
  rayClusterSpec:
    rayVersion: "2.52.0"
    headGroupSpec:
      template:
        spec:
          restartPolicy: Never
          initContainers:
            - name: intentional-fail-init
              image: busybox:1.36
              command: ["sh", "-c", "echo failing on purpose && exit 42"]
          containers:
            - name: ray-head
              image: rayproject/ray:2.52.0
```

`kubectl get po -w`

```shell
AME                                           READY   STATUS       RESTARTS       AGE
rayjob-failfast-custom-init-k49xs-head-slt2v   0/1     Init:Error   0              1s
rayjob-failfast-custom-init-k49xs-head-slt2v   0/1     Init:Error   0              2s
rayjob-failfast-custom-init-k49xs-head-slt2v   0/1     Init:Error   0              2s
rayjob-failfast-custom-init-k49xs-head-slt2v   0/1     Init:Error   0              2s
rayjob-failfast-custom-init-k49xs-head-gwhjh   0/1     Pending      0              0s
rayjob-failfast-custom-init-k49xs-head-gwhjh   0/1     Pending      0              0s
rayjob-failfast-custom-init-k49xs-head-gwhjh   0/1     Init:0/1     0              0s
rayjob-failfast-custom-init-k49xs-head-gwhjh   0/1     Init:Error   0              1s
rayjob-failfast-custom-init-k49xs-head-gwhjh   0/1     Init:Error   0              2s
rayjob-failfast-custom-init-k49xs-head-gwhjh   0/1     Init:Error   0              2s
rayjob-failfast-custom-init-k49xs-head-gwhjh   0/1     Init:Error   0              2s
rayjob-failfast-custom-init-k49xs-head-v6zrf   0/1     Pending      0              0s
rayjob-failfast-custom-init-k49xs-head-v6zrf   0/1     Pending      0              0s
rayjob-failfast-custom-init-k49xs-head-v6zrf   0/1     Init:0/1     0              0s
```

### Context

Current controls:
- `activeDeadlineSeconds`: whole RayJob lifecycle timeout
- `preRunningDeadlineSeconds`: pre-running timeout

These are time-based guardrails, but they do not fail fast when init has already deterministically failed.

### Request

Please add an optional RayJob fail-fast switch, for example:

```yaml
spec:
  failFastOnClusterInitContainerFailure: true
```

When enabled, during pre-running states (e.g. `Initializing`), if any **user-defined** init container in RayCluster Pods enters a fatal failure state, RayJob should transition to `Failed` immediately.

### Suggested fatal signals

- `state.terminated.exitCode != 0`

### Expected result

- `status.jobDeploymentStatus = Failed`
- dedicated failure reason (e.g. `ClusterInitContainerFailed`)
- clear message with pod/container/reason
- warning event emitted for observability

### Why

For deterministic init failures, RayJob should fail immediately from orchestration semantics, rather than waiting for a timeout.

### Use case

We use **custom init containers** in RayJob-created RayCluster Pods to perform setup before RayJob starts, for example:
- downloading datasets from object storage / remote sources,
- preparing local cache / unpacking artifacts,
- other mandatory bootstrap steps.
These steps can fail deterministically (download/auth/network/config errors, script failures, etc.).  
When that happens, we expect the corresponding RayJob to **fail immediately**, instead of waiting for a time-based deadline.

### Related issues

not duplicate

https://github.com/ray-project/kuberay/pull/4525
https://github.com/ray-project/kuberay/issues/4178
https://github.com/ray-project/kuberay/issues/4037
https://github.com/ray-project/kuberay/issues/2735
https://github.com/ray-project/kuberay/issues/2125
https://github.com/ray-project/kuberay/issues/988

### Are you willing to submit a PR?

- [x] Yes I am willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Fail RayJob immediately on deterministic RayCluster init-container failures #4637

Search before asking

Description

Context

Request

Suggested fatal signals

Expected result

Why

Use case

Related issues

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Fail RayJob immediately on deterministic RayCluster init-container failures #4637

Description

Search before asking

Description

Context

Request

Suggested fatal signals

Expected result

Why

Use case

Related issues

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions