Skip to content

[autoscaler] remove worker_default_node_type that is useless.#13588

Merged
ericl merged 110 commits intoray-project:masterfrom
AmeerHajAli:yaml
Jan 22, 2021
Merged

[autoscaler] remove worker_default_node_type that is useless.#13588
ericl merged 110 commits intoray-project:masterfrom
AmeerHajAli:yaml

Conversation

@AmeerHajAli
Copy link
Copy Markdown
Contributor

the field is not used anywhere.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@DmitriGekhtman
Copy link
Copy Markdown
Contributor

Took out everything involving "workerDefaultPodType" from operator-related things and a stray line involving "worker_default_node_type" from "rewrite_legacy_yaml_to_available_node_types".
Tested the operator with these changes for good measure.

LGTM -- just need to take another look at test output once CI is done.

@AmeerHajAli
Copy link
Copy Markdown
Contributor Author

@ericl , please merge.

@ericl ericl merged commit 1fbb752 into ray-project:master Jan 22, 2021
@alanwguo
Copy link
Copy Markdown
Contributor

This is a backwards-incompatible change. (Ex: user must remove this field in their existing cluster configs after they upgrade to a version of ray with this change)

@AmeerHajAli, @ericl,
I'd like to get a better understanding of the compatibility guarantees of the cluster config and ray in general.

Will this be a major version change?
Will there be migration instructions in the change logs?
Should the field be marked as deprecated first before deleting?

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Jan 22, 2021 via email

@AmeerHajAli
Copy link
Copy Markdown
Contributor Author

I can put it back in the schema for now. @alanwguo @ericl what do you think?

@paravatha
Copy link
Copy Markdown

paravatha commented Jan 24, 2021

This is a backwards-incompatible change. (Ex: user must remove this field in their existing cluster configs after they upgrade to a version of ray with this change)

@AmeerHajAli, @ericl,
I'd like to get a better understanding of the compatibility guarantees of the cluster config and ray in general.

Will this be a major version change?
Will there be migration instructions in the change logs?
Should the field be marked as deprecated first before deleting?

Trying on GKE version : 1.17.14-gke.400

1. @AmeerHajAli @DmitriGekhtman +1 on this. Getting this error
error: error validating "ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster.yaml": error validating data: ValidationError(RayCluster.spec): missing required field "workerDefaultPodType" in io.ray.cluster.v1.RayCluster.spec; if you choose to ignore these errors, turn validation off with --validate=false

2. Additionally, When I try 2 weeks old yamls( via my fork : https://github.com/paravatha/ray)
I am getting CrashLoopBackOff for nightly image(image: rayproject/ray:nightly)

NAME READY STATUS RESTARTS AGE
ray-operator-pod 0/1 CrashLoopBackOff 6 9m29s

@paravatha
Copy link
Copy Markdown

Switched to rayproject/ray:latest, but seeing the same issue

Normal Scheduled 89s default-scheduler Successfully assigned ray/ray-operator-pod to gke-ray-cluster-default-pool-938c5c6a-71c1
Normal Pulling 10s (x4 over 88s) kubelet Pulling image "rayproject/ray:latest"
Normal Pulled 10s (x4 over 57s) kubelet Successfully pulled image "rayproject/ray:latest"
Normal Created 10s (x4 over 52s) kubelet Created container ray
Normal Started 10s (x4 over 52s) kubelet Started container ray
Warning BackOff 8s (x5 over 48s) kubelet Back-off restarting failed container
cloud_user_p_a1820db4@cloudshell:~ (playground-s-11-a43fdecd)$

@DmitriGekhtman
Copy link
Copy Markdown
Contributor

Hi @paravatha -- we'll be more careful with backwards-incompatible changes.
Also, problems like this should be less frequent once the k8s operator is in a stable monthly release of Ray.

Temporarily, what can be done is to use an image built from the last commit before this PR was merged:
In operator.yaml, for now you can use rayproject/ray:4e01a9. The tag is the first six digits of the commit SHA for that commit, read from here: https://github.com/ray-project/ray/commits/master

More info on Docker source images here: https://docs.ray.io/en/master/installation.html#docker-source-images

@AmeerHajAli
Copy link
Copy Markdown
Contributor Author

@DmitriGekhtman , is the issue still there despite this fix?
#13637

@DmitriGekhtman
Copy link
Copy Markdown
Contributor

@AmeerHajAli
I attempted to preserve backwards compatibility of the operator in this #13623 but ended up introducing a bug which is addressed in #13663, which is yet to be merged.

fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants