[Feat] Rayjob add ttl before running state#4525
Conversation
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
| // EDIT THIS FILE! THIS IS SCAFFOLDING FOR YOU TO OWN! | ||
| // NOTE: json tags are required. Any new fields you add must have json tags for the fields to be serialized. | ||
|
|
||
| //nolint:govet // RayCronJobSpec defines the desired state of RayCronJob |
There was a problem hiding this comment.
This is removed by the pre-commit
| // This is useful for cleaning up jobs stuck in Initializing or Waiting states. | ||
| // If not set, there is no TTL. Value must be a positive integer. | ||
| // +optional | ||
| TTLSecondsBeforeRunning *int32 `json:"ttlSecondsBeforeRunning,omitempty"` |
There was a problem hiding this comment.
Overall LGTM, just thinking about if using StartupDeadlineSeconds or PreRunningDeadlineSeconds would be a more consistent naming as we use ActiveDeadlineSeconds for the active ttl.
There was a problem hiding this comment.
Updated to PreRunningDeadlineSeconds in 90ab655
| // This is useful for cleaning up jobs stuck in Initializing or Waiting states. | ||
| // If not set, there is no TTL. Value must be a positive integer. | ||
| // +optional | ||
| TTLSecondsBeforeRunning *int32 `json:"ttlSecondsBeforeRunning,omitempty"` |
There was a problem hiding this comment.
cc @rueian @andrewsykim to think about the naming, tks!
There was a problem hiding this comment.
Naming seems fine. But I am wondering if we should expand the DeletionStrategy API instead to cover this case instead of adding a new field https://github.com/ray-project/kuberay/blob/master/ray-operator/apis/ray/v1/rayjob_types.go#L90-L190
There was a problem hiding this comment.
If we use DeletionPolicy, we can only clean-up the resource and there’s no option to just fail it. Also the clean-up logic will appear in two different places, which can make it harder to maintain.
With the current method, we can fail the RayJob and let the Failed state handle the clean-up, which can be better as we can keep the clean-up logic in the same place.
WDYT?
There was a problem hiding this comment.
Good point, the DeletionPolicy is for defining TTL based on whether job succeed / failed. This feature is to transition jobs to FAILED, so makes sense to keep it separate.
There was a problem hiding this comment.
On second thought, is it possible to just introduce a new OnInitiailizing or OnWaiting to the DeletionPolicy API to add a TTL similar to this?
There was a problem hiding this comment.
For me the better process here is: reach pre-running timeout -> transit to Failed -> clean-up in failed state.
I think adding OnInitiailizing or OnWaiting to DeletionPolicy would run into the same issue as DeletionPolicy is about how/when to clean up resources, not about job state transitions.
In this case, if we put a “fail-on-timeout” behavior into DeletionPolicy, it would blur the API semantics as DeletionPolicy should only handle the deletion. Therefor I’d prefer keeping PreRunningDeadlineSeconds separate from DeletionPolicy
There was a problem hiding this comment.
Are you expecting we would support scenarios where someone sets ttlSecondsBeforeRunning but not shutdownAfterJobFinishes?
There was a problem hiding this comment.
Yes! I would also want to place the clean-up code in the same place (all handle in completed phase).
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
| if shouldUpdate := checkPreRunningDeadlineAndUpdateStatusIfNeeded(ctx, rayJobInstance); shouldUpdate { | ||
| break | ||
| } | ||
|
|
There was a problem hiding this comment.
PreRunningDeadlineExceeded not excluded from retry logic
Medium Severity
The checkBackoffLimitAndUpdateStatusIfNeeded function prevents retries when the reason is DeadlineExceeded, but the newly introduced PreRunningDeadlineExceeded reason is not similarly excluded. When a RayJob fails with PreRunningDeadlineExceeded and a BackoffLimit is configured, the job will be retried — inconsistent with how DeadlineExceeded is handled. The retry would just recreate a RayCluster likely to hit the same deadline again, causing unnecessary resource churn.
Additional Locations (1)
There was a problem hiding this comment.
It's expected, PreRunningDeadlineExceeded is the deadline for pre-running step. When this pre-running deadline exceed, it may be caused by some temporarily issues (e.g. connection issue, resource not available now). We should make it retryable.
|
Looking forward to see this commit in v1.6.0 @Future-Outlier @machichima .Thanks a lot in advance! |
Signed-off-by: Future-Outlier <eric901201@gmail.com>
| // PreRunningDeadlineSeconds is the deadline in seconds for a RayJob to reach the Running state | ||
| // from when it is first initialized (StartTime). If the RayJob does not transition to | ||
| // Running within this time, it will be marked as Failed. | ||
| // This is useful for cleaning up jobs stuck in Initializing or Waiting states. | ||
| // If not set, there is no deadline. Value must be a positive integer. | ||
| // +optional | ||
| PreRunningDeadlineSeconds *int32 `json:"preRunningDeadlineSeconds,omitempty"` |
There was a problem hiding this comment.
| // PreRunningDeadlineSeconds is the deadline in seconds for a RayJob to reach the Running state | |
| // from when it is first initialized (StartTime). If the RayJob does not transition to | |
| // Running within this time, it will be marked as Failed. | |
| // This is useful for cleaning up jobs stuck in Initializing or Waiting states. | |
| // If not set, there is no deadline. Value must be a positive integer. | |
| // +optional | |
| PreRunningDeadlineSeconds *int32 `json:"preRunningDeadlineSeconds,omitempty"` | |
| // PreRunningDeadlineSeconds is the deadline in seconds for a RayJob to reach the Running state | |
| // from when it is first initialized (StartTime). If the RayJob does not transition to | |
| // Running within this time, it will be marked as Failed. | |
| // This is useful for cleaning up jobs stuck in Initializing or Waiting states. | |
| // If not set, there is no deadline. Value must be a positive integer. | |
| // +kubebuilder:validation:Minimum=1 | |
| // +optional | |
| PreRunningDeadlineSeconds *int32 `json:"preRunningDeadlineSeconds,omitempty"` |
nit: we can add +kubebuilder:validation, so invalid values get rejected at admission time.
| rayJobAC := rayv1ac.RayJob("ttl-waiting", namespace.Name). | ||
| WithSpec(rayv1ac.RayJobSpec(). | ||
| WithSubmissionMode(rayv1.InteractiveMode). | ||
| WithShutdownAfterJobFinishes(true). | ||
| WithPreRunningDeadlineSeconds(30)) // larger value to reach Initializing state first |
There was a problem hiding this comment.
| rayJobAC := rayv1ac.RayJob("ttl-waiting", namespace.Name). | |
| WithSpec(rayv1ac.RayJobSpec(). | |
| WithSubmissionMode(rayv1.InteractiveMode). | |
| WithShutdownAfterJobFinishes(true). | |
| WithPreRunningDeadlineSeconds(30)) // larger value to reach Initializing state first | |
| rayJobAC := rayv1ac.RayJob("ttl-waiting", namespace.Name). | |
| WithSpec(rayv1ac.RayJobSpec(). | |
| WithSubmissionMode(rayv1.InteractiveMode). | |
| WithRayClusterSpec(NewRayClusterSpec()). | |
| WithShutdownAfterJobFinishes(true). | |
| WithPreRunningDeadlineSeconds(30)) // larger value to reach Initializing state first |
There was a problem hiding this comment.
Tested this locally and seems that WithPreRunningDeadlineSeconds may require around 60 seconds to pass. Not sure if the 30 seconds in Buildkite is sufficient.
There was a problem hiding this comment.
Fixed and update timeout to 60 sec in afda0ff
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
|
Added docs in ray-project/ray#61552 |
## Description We introduce new config field `preRunningDeadlineSeconds` in ray-project/kuberay#4525. This PR add the related docs. ## Related issues Related to ray-project/kuberay#4525 ## Additional information docs link: - https://anyscale-ray--61552.com.readthedocs.build/en/61552/cluster/kubernetes/getting-started/rayjob-quick-start.html#rayjob-configuration - https://anyscale-ray--61552.com.readthedocs.build/en/61552/cluster/kubernetes/user-guides/kubectl-plugin.html#submit-a-ray-job-without-a-yaml-file --------- Signed-off-by: machichima <nary12321@gmail.com>
## Description We introduce new config field `preRunningDeadlineSeconds` in ray-project/kuberay#4525. This PR add the related docs. ## Related issues Related to ray-project/kuberay#4525 ## Additional information docs link: - https://anyscale-ray--61552.com.readthedocs.build/en/61552/cluster/kubernetes/getting-started/rayjob-quick-start.html#rayjob-configuration - https://anyscale-ray--61552.com.readthedocs.build/en/61552/cluster/kubernetes/user-guides/kubectl-plugin.html#submit-a-ray-job-without-a-yaml-file --------- Signed-off-by: machichima <nary12321@gmail.com> Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>
## Description We introduce new config field `preRunningDeadlineSeconds` in ray-project/kuberay#4525. This PR add the related docs. ## Related issues Related to ray-project/kuberay#4525 ## Additional information docs link: - https://anyscale-ray--61552.com.readthedocs.build/en/61552/cluster/kubernetes/getting-started/rayjob-quick-start.html#rayjob-configuration - https://anyscale-ray--61552.com.readthedocs.build/en/61552/cluster/kubernetes/user-guides/kubectl-plugin.html#submit-a-ray-job-without-a-yaml-file --------- Signed-off-by: machichima <nary12321@gmail.com>
* feat: add TTLSecondsBeforeRunning field Signed-off-by: machichima <nary12321@gmail.com> * feat: check ttlSecondsBeforeRunning in init/wait state Signed-off-by: machichima <nary12321@gmail.com> * test: add init/wait TTL e2e Signed-off-by: machichima <nary12321@gmail.com> * build: make sync & api-docs Signed-off-by: machichima <nary12321@gmail.com> * Trigger CI Signed-off-by: machichima <nary12321@gmail.com> * fix: regen api docs Signed-off-by: machichima <nary12321@gmail.com> * refactor: rename to PreRunningDeadlineSeconds Signed-off-by: machichima <nary12321@gmail.com> * fix: set the reason to PreRunningDeadlineExceeded Signed-off-by: machichima <nary12321@gmail.com> * test: fix test Signed-off-by: machichima <nary12321@gmail.com> * build: make sync Signed-off-by: machichima <nary12321@gmail.com> * remove redundant code Signed-off-by: Future-Outlier <eric901201@gmail.com> * feat: add kubebuilder validation Signed-off-by: machichima <nary12321@gmail.com> * fix: add WithRayClusterSpec & update timeout to 60 sec Signed-off-by: machichima <nary12321@gmail.com> * docs: update rayjob sample and interactive mode YAML Signed-off-by: machichima <nary12321@gmail.com> * Trigger CI Signed-off-by: machichima <nary12321@gmail.com> --------- Signed-off-by: machichima <nary12321@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Future-Outlier <eric901201@gmail.com>
## Description We introduce new config field `preRunningDeadlineSeconds` in ray-project/kuberay#4525. This PR add the related docs. ## Related issues Related to ray-project/kuberay#4525 ## Additional information docs link: - https://anyscale-ray--61552.com.readthedocs.build/en/61552/cluster/kubernetes/getting-started/rayjob-quick-start.html#rayjob-configuration - https://anyscale-ray--61552.com.readthedocs.build/en/61552/cluster/kubernetes/user-guides/kubectl-plugin.html#submit-a-ray-job-without-a-yaml-file --------- Signed-off-by: machichima <nary12321@gmail.com>


Why are these changes needed?
Currently there's no TTL before reaching "Running" state. There are few cases that will cause RayJob hang before "Running":
RayJobwithWaitingstatus needs a ttl mechanism #4037This PR introduce a new field
ttlSecondsBeforeRunningto configure the TTL for pre-running steps.Related issue number
Closes #4037
Closes #4178
Checks