[Feature] Add default init container in workers to wait for GCS to be ready#973
Merged
kevin85421 merged 2 commits intoray-project:masterfrom Mar 20, 2023
Merged
Conversation
Member
Author
gvspraveen
approved these changes
Mar 17, 2023
5222472 to
139bc5b
Compare
DmitriGekhtman
approved these changes
Mar 18, 2023
Collaborator
DmitriGekhtman
left a comment
There was a problem hiding this comment.
Looks good.
There's a slight concern some users may require additional configuration (security policy, etc.) copied over to the init container.
I wonder if we should copy the entire container spec and replace the entry point -- that could have unforeseen consequences for some users, though.
( OSS is hard :) )
Member
|
Examined the implementation of the btw, looks like |
Member
Author
2 tasks
4 tasks
This was referenced Apr 5, 2023
2 tasks
2 tasks
4 tasks
2 tasks
lowang-bh
pushed a commit
to lowang-bh/kuberay
that referenced
this pull request
Sep 24, 2023
… ready (ray-project#973) Add default init container in workers to wait for GCS to be ready
DavidAdaRH
pushed a commit
to DavidAdaRH/kuberay
that referenced
this pull request
Mar 26, 2026
…r digest to bfbbc56 (ray-project#973) Signed-off-by: konflux-internal-p02 <170854209+konflux-internal-p02[bot]@users.noreply.github.com> Co-authored-by: konflux-internal-p02[bot] <170854209+konflux-internal-p02[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Why are these changes needed?
Currently, the init container logic is wrong. It waits for the head service rather than GCS server. The head service will be ready when the image pull finishes. The current retry logic is implemented by Ray internal.
kuberay/ray-operator/config/samples/ray-cluster.complete.yaml
Lines 124 to 129 in 71e260f
For example, add
command: ["sleep 180"]in the headGroupSpec. Then, the head Pod command will besleep 180 && ulimit -n 65536; ray start .... To clarify, the GCS server requires a minimum of 120 seconds to become ready after the head service is ready. It exceed the timeout of the Ray internal retry mechanism, so the worker will fail.In this PR, we add a default init container to use
ray health-checkto check the status of GCS so that can prevent this issue. In addition, each init container must complete successfully before the next one starts. Hence, it is fine for us to have two init containers at this moment to keep backward compatibility. We will remove the original init container from sample YAML files after release 0.5.0.Related issue number
Closes #476
Checks