Skip to content

[Feat] Rayjob add ttl before running state#4525

Merged
rueian merged 15 commits intoray-project:masterfrom
machichima:rayjob-ttl-before-running
Mar 8, 2026
Merged

[Feat] Rayjob add ttl before running state#4525
rueian merged 15 commits intoray-project:masterfrom
machichima:rayjob-ttl-before-running

Conversation

@machichima
Copy link
Copy Markdown
Collaborator

@machichima machichima commented Feb 21, 2026

Why are these changes needed?

Currently there's no TTL before reaching "Running" state. There are few cases that will cause RayJob hang before "Running":

  1. Using interactive mode: [Feature] RayJob with Waiting status needs a ttl mechanism #4037
  2. Head pod failed to start: [Feature] RayJob with Initializing status needs a ttl mechanism #4178

This PR introduce a new field ttlSecondsBeforeRunning to configure the TTL for pre-running steps.

Related issue number

Closes #4037
Closes #4178

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
// EDIT THIS FILE! THIS IS SCAFFOLDING FOR YOU TO OWN!
// NOTE: json tags are required. Any new fields you add must have json tags for the fields to be serialized.

//nolint:govet // RayCronJobSpec defines the desired state of RayCronJob
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is removed by the pre-commit

@machichima machichima marked this pull request as ready for review February 23, 2026 13:35
Copy link
Copy Markdown
Contributor

@seanlaii seanlaii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

// This is useful for cleaning up jobs stuck in Initializing or Waiting states.
// If not set, there is no TTL. Value must be a positive integer.
// +optional
TTLSecondsBeforeRunning *int32 `json:"ttlSecondsBeforeRunning,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just thinking about if using StartupDeadlineSeconds or PreRunningDeadlineSeconds would be a more consistent naming as we use ActiveDeadlineSeconds for the active ttl.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to PreRunningDeadlineSeconds in 90ab655

// This is useful for cleaning up jobs stuck in Initializing or Waiting states.
// If not set, there is no TTL. Value must be a positive integer.
// +optional
TTLSecondsBeforeRunning *int32 `json:"ttlSecondsBeforeRunning,omitempty"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @rueian @andrewsykim to think about the naming, tks!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming seems fine. But I am wondering if we should expand the DeletionStrategy API instead to cover this case instead of adding a new field https://github.com/ray-project/kuberay/blob/master/ray-operator/apis/ray/v1/rayjob_types.go#L90-L190

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use DeletionPolicy, we can only clean-up the resource and there’s no option to just fail it. Also the clean-up logic will appear in two different places, which can make it harder to maintain.

With the current method, we can fail the RayJob and let the Failed state handle the clean-up, which can be better as we can keep the clean-up logic in the same place.

WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, the DeletionPolicy is for defining TTL based on whether job succeed / failed. This feature is to transition jobs to FAILED, so makes sense to keep it separate.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, is it possible to just introduce a new OnInitiailizing or OnWaiting to the DeletionPolicy API to add a TTL similar to this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me the better process here is: reach pre-running timeout -> transit to Failed -> clean-up in failed state.

I think adding OnInitiailizing or OnWaiting to DeletionPolicy would run into the same issue as DeletionPolicy is about how/when to clean up resources, not about job state transitions.

In this case, if we put a “fail-on-timeout” behavior into DeletionPolicy, it would blur the API semantics as DeletionPolicy should only handle the deletion. Therefor I’d prefer keeping PreRunningDeadlineSeconds separate from DeletionPolicy

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you expecting we would support scenarios where someone sets ttlSecondsBeforeRunning but not shutdownAfterJobFinishes?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I would also want to place the clean-up code in the same place (all handle in completed phase).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense,thanks!

Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
if shouldUpdate := checkPreRunningDeadlineAndUpdateStatusIfNeeded(ctx, rayJobInstance); shouldUpdate {
break
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PreRunningDeadlineExceeded not excluded from retry logic

Medium Severity

The checkBackoffLimitAndUpdateStatusIfNeeded function prevents retries when the reason is DeadlineExceeded, but the newly introduced PreRunningDeadlineExceeded reason is not similarly excluded. When a RayJob fails with PreRunningDeadlineExceeded and a BackoffLimit is configured, the job will be retried — inconsistent with how DeadlineExceeded is handled. The retry would just recreate a RayCluster likely to hit the same deadline again, causing unnecessary resource churn.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Collaborator Author

@machichima machichima Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's expected, PreRunningDeadlineExceeded is the deadline for pre-running step. When this pre-running deadline exceed, it may be caused by some temporarily issues (e.g. connection issue, resource not available now). We should make it retryable.

@OneSizeFitsQuorum
Copy link
Copy Markdown

Looking forward to see this commit in v1.6.0 @Future-Outlier @machichima .Thanks a lot in advance!

Signed-off-by: Future-Outlier <eric901201@gmail.com>
Copy link
Copy Markdown
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tks!

@Future-Outlier Future-Outlier moved this from others to can be merged in @Future-Outlier's kuberay project Mar 6, 2026
Copy link
Copy Markdown
Member

@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines +286 to +292
// PreRunningDeadlineSeconds is the deadline in seconds for a RayJob to reach the Running state
// from when it is first initialized (StartTime). If the RayJob does not transition to
// Running within this time, it will be marked as Failed.
// This is useful for cleaning up jobs stuck in Initializing or Waiting states.
// If not set, there is no deadline. Value must be a positive integer.
// +optional
PreRunningDeadlineSeconds *int32 `json:"preRunningDeadlineSeconds,omitempty"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// PreRunningDeadlineSeconds is the deadline in seconds for a RayJob to reach the Running state
// from when it is first initialized (StartTime). If the RayJob does not transition to
// Running within this time, it will be marked as Failed.
// This is useful for cleaning up jobs stuck in Initializing or Waiting states.
// If not set, there is no deadline. Value must be a positive integer.
// +optional
PreRunningDeadlineSeconds *int32 `json:"preRunningDeadlineSeconds,omitempty"`
// PreRunningDeadlineSeconds is the deadline in seconds for a RayJob to reach the Running state
// from when it is first initialized (StartTime). If the RayJob does not transition to
// Running within this time, it will be marked as Failed.
// This is useful for cleaning up jobs stuck in Initializing or Waiting states.
// If not set, there is no deadline. Value must be a positive integer.
// +kubebuilder:validation:Minimum=1
// +optional
PreRunningDeadlineSeconds *int32 `json:"preRunningDeadlineSeconds,omitempty"`

nit: we can add +kubebuilder:validation, so invalid values get rejected at admission time.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 11ec6a3

Comment on lines +612 to +616
rayJobAC := rayv1ac.RayJob("ttl-waiting", namespace.Name).
WithSpec(rayv1ac.RayJobSpec().
WithSubmissionMode(rayv1.InteractiveMode).
WithShutdownAfterJobFinishes(true).
WithPreRunningDeadlineSeconds(30)) // larger value to reach Initializing state first
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rayJobAC := rayv1ac.RayJob("ttl-waiting", namespace.Name).
WithSpec(rayv1ac.RayJobSpec().
WithSubmissionMode(rayv1.InteractiveMode).
WithShutdownAfterJobFinishes(true).
WithPreRunningDeadlineSeconds(30)) // larger value to reach Initializing state first
rayJobAC := rayv1ac.RayJob("ttl-waiting", namespace.Name).
WithSpec(rayv1ac.RayJobSpec().
WithSubmissionMode(rayv1.InteractiveMode).
WithRayClusterSpec(NewRayClusterSpec()).
WithShutdownAfterJobFinishes(true).
WithPreRunningDeadlineSeconds(30)) // larger value to reach Initializing state first

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this locally and seems that WithPreRunningDeadlineSeconds may require around 60 seconds to pass. Not sure if the 30 seconds in Buildkite is sufficient.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and update timeout to 60 sec in afda0ff

Signed-off-by: machichima <nary12321@gmail.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
@machichima
Copy link
Copy Markdown
Collaborator Author

Added docs in ray-project/ray#61552

@rueian rueian merged commit 7092f76 into ray-project:master Mar 8, 2026
31 checks passed
@github-project-automation github-project-automation bot moved this from can be merged to Done in @Future-Outlier's kuberay project Mar 8, 2026
edoakes pushed a commit to ray-project/ray that referenced this pull request Mar 9, 2026
ParagEkbote pushed a commit to ParagEkbote/ray that referenced this pull request Mar 10, 2026
## Description

We introduce new config field `preRunningDeadlineSeconds` in
ray-project/kuberay#4525. This PR add the
related docs.

## Related issues

Related to ray-project/kuberay#4525

## Additional information

docs link:
-
https://anyscale-ray--61552.com.readthedocs.build/en/61552/cluster/kubernetes/getting-started/rayjob-quick-start.html#rayjob-configuration
-
https://anyscale-ray--61552.com.readthedocs.build/en/61552/cluster/kubernetes/user-guides/kubectl-plugin.html#submit-a-ray-job-without-a-yaml-file

---------

Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>
abrarsheikh pushed a commit to ray-project/ray that referenced this pull request Mar 11, 2026
hango880623 pushed a commit to hango880623/kuberay that referenced this pull request Mar 13, 2026
* feat: add TTLSecondsBeforeRunning field

Signed-off-by: machichima <nary12321@gmail.com>

* feat: check ttlSecondsBeforeRunning in init/wait state

Signed-off-by: machichima <nary12321@gmail.com>

* test: add init/wait TTL e2e

Signed-off-by: machichima <nary12321@gmail.com>

* build: make sync & api-docs

Signed-off-by: machichima <nary12321@gmail.com>

* Trigger CI

Signed-off-by: machichima <nary12321@gmail.com>

* fix: regen api docs

Signed-off-by: machichima <nary12321@gmail.com>

* refactor: rename to PreRunningDeadlineSeconds

Signed-off-by: machichima <nary12321@gmail.com>

* fix: set the reason to PreRunningDeadlineExceeded

Signed-off-by: machichima <nary12321@gmail.com>

* test: fix test

Signed-off-by: machichima <nary12321@gmail.com>

* build: make sync

Signed-off-by: machichima <nary12321@gmail.com>

* remove redundant code

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* feat: add kubebuilder validation

Signed-off-by: machichima <nary12321@gmail.com>

* fix: add WithRayClusterSpec & update timeout to 60 sec

Signed-off-by: machichima <nary12321@gmail.com>

* docs: update rayjob sample and interactive mode YAML

Signed-off-by: machichima <nary12321@gmail.com>

* Trigger CI

Signed-off-by: machichima <nary12321@gmail.com>

---------

Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Co-authored-by: Future-Outlier <eric901201@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] RayJob with Initializing status needs a ttl mechanism [Feature] RayJob with Waiting status needs a ttl mechanism

8 participants