Skip to content

[Feature] Clean up init container configuration and startup sequence. #476

@DmitriGekhtman

Description

@DmitriGekhtman

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

At the moment, we instruct users to include an init container in each worker group spec.
The purpose of the init container is to wait for the service exposing the head GCS server to be created before the worker attempts ray start.

There are two issues with the current setup:

  • Having to include the init container makes the minimal configuration for a RayCluster messier. If an init container is necessary, it would be better to have the KubeRay operator create it by default.
  • The current logic is not quite correct, for the following reason:
    After the initContainer determines that the head service is ready, the Ray worker container immediately runs Ray start,
    whether or not the GCS is ready. Ray start has internal retry logic that eventually gives up if the head pod is not started
    quickly enough -- the worker container will then crash-loop. (This is not that bad given the typical time scales for
    provisioning Ray pods and given ray start's internal timeout.)

The tasks are to simplify configuration and correct the logic.

Two ways to correct the logic:

  1. Implement an initContainer that waits for the GCS to be ready.
  2. Drop the initContainer and just have the Ray container's entry-point wait as long as necessary.

Advantage of 2. is that it's simpler.

Advantage of 1. is that it's perhaps more idiomatic and gives more feedback to a user who is examining worker pod status with kubectl get pod -- the user can distinguish "Initializing" and "Running" states for the worker container.

If we stick with an initContainer (option 2), we can either

  1. Have the operator configure it automatically OR
  2. Leave it alone, leave it to Helm to hide that configuration, and invest in Helm as the preferred means of deploying.

Use case

Interface cleanup.

Related issues

This falls under the generic category of "interface cleanup", for which we have this issue:
#368

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksenhancementNew feature or requeststabilityPertains to basic infrastructure stability

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions