[autoscaler][kubernetes] Add ability to not copy cluster config to head node when calling `create_or_update_head_node`. by DmitriGekhtman · Pull Request #13720 · ray-project/ray

DmitriGekhtman · 2021-01-26T22:50:18Z

Why are these changes needed?

In situations where we don't run the monitor on the head node, there's no point in copying the cluster launching config onto the head node.
The copying operation has lead to issues with the K8s operator (#13569).

This PR makes the following changes:

Adds a no_monitor_on_head_node boolean argument to commands.create_or_update_or_cluster.
When set to True, the cluster launching config is not copied to the head node.
Updates the K8s operator is to use the new flag.
Updates the K8s operator example configs to use the --no-monitor flag (Add ability to not start Monitor when calling ray start #13505) in the head ray start commands.
To clean up the code a bit, I hid the config-copying logic behind a function call.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(
Tested manually by running the unit tests for K8s operator and Ray on K8s cluster launcher.

python/ray/autoscaler/_private/commands.py

AmeerHajAli · 2021-01-26T23:38:44Z

python/ray/autoscaler/_private/commands.py

can you add type annotation and function level description?
Any idea for a better name since we already have provider.prepare_for_head_node?

python/ray/autoscaler/_private/commands.py

DmitriGekhtman · 2021-01-27T00:37:25Z

Hmm, yyeah, I guess these changes are useful only for the Kubernetes operator.

dmatch01 · 2021-01-27T16:14:08Z

python/ray/autoscaler/_private/commands.py

Just wanted to clarify the expected logic for this line in the K8s Operator use case where I assume no_monitor_on_head=true:

The warning will not logged if "autoscaling-config" is not inray_start_cmd. Is that the correct/intended interpretation?

In addition, all the use cases where no_monitor_on_head=false:

The warning will be logged whether or not "autoscaling-config" is inray_start_cmd?

If no_monitor_on_head is true, the if condition becomes
not (A or True) = False,
so the warning isn’t logged.

The readability here could be improved.

Got it. And I see you improved read ability as well in the PR update. Thank you!

AmeerHajAli · 2021-01-30T01:27:18Z

This looks good to me. Can we add a test that verifies it solves the issue?

DmitriGekhtman · 2021-02-01T02:38:29Z

This looks good to me. Can we add a test that verifies it solves the issue?

Added a test which confirms that no files are mounted during the operator's Ray cluster creation.

AmeerHajAli · 2021-02-01T08:51:09Z

The test you added fails for any file mounts. Does that mean that the operator users should have nothing in their file_mounts config? Why not throw an error if they explicitly tried to mount?

DmitriGekhtman · 2021-02-01T16:52:55Z

Good question -- there's actually no user-interface for specifying file mounts for the operator. Users configure and apply a Kubernetes custom resource, which is internally translated to an autoscaling config. There's no file mounts field for the custom resource and the internal autoscaling config has no file mounts. This does suggest an improvement to the test: instead of reading in an autoscaling config, read in a cluster custom resource and translate it to an autoscaling config as the operator does. This has the bonus that it tests the function that does the translation. Will go ahead and make this change in the test.

…

On Mon, Feb 1, 2021 at 12:51 AM Ameer Haj Ali ***@***.***> wrote: The test you added fails for any file mounts. Does that mean that the operator users should have nothing in their file_mounts config? Why not throw an error if they explicitly tried to mount? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13720 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APAQTKZCCMENS76BTCNCG3TS4ZTQZANCNFSM4WUHYIDA> .

DmitriGekhtman · 2021-02-01T18:15:33Z

Will wait for tests to finish.

DmitriGekhtman · 2021-02-01T20:36:25Z

@AmeerHajAli Tests look good. Awaiting your final review.

DmitriGekhtman · 2021-02-02T17:49:38Z

Added a comment describing the additional flag.

python/ray/autoscaler/_private/commands.py

AmeerHajAli

LGTM. left a minor comment.

AmeerHajAli · 2021-02-02T21:43:27Z

@ericl , this looks good to me. Can you please merge?

dmatch01 · 2021-02-04T13:31:55Z

@ericl Just wanted to give a gentle nudge for the merge. Would like to continue with my testing once this merge is available. Thank you! cc: @AmeerHajAli @DmitriGekhtman

dmatch01 · 2021-02-04T18:44:40Z

@ericl Thank you!

…ad node when calling `create_or_update_head_node`. (ray-project#13720) * Add option to skip bootstrapping head node autoscaling config * don't close remote config before copying * Type * Type hints etc. * test * Test CR to config conversion * comment

…ig to head node when calling `create_or_update_head_node`. (ray-project#13720)" This reverts commit 729883a.

DmitriGekhtman assigned AmeerHajAli and ericl Jan 26, 2021

AmeerHajAli reviewed Jan 26, 2021

View reviewed changes

python/ray/autoscaler/_private/commands.py Outdated Show resolved Hide resolved

AmeerHajAli reviewed Jan 26, 2021

View reviewed changes

python/ray/autoscaler/_private/commands.py Outdated Show resolved Hide resolved

AmeerHajAli reviewed Jan 26, 2021

View reviewed changes

python/ray/autoscaler/_private/commands.py Outdated Show resolved Hide resolved

AmeerHajAli reviewed Jan 26, 2021

View reviewed changes

python/ray/autoscaler/_private/commands.py Outdated Show resolved Hide resolved

dmatch01 reviewed Jan 27, 2021

View reviewed changes

AmeerHajAli added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 27, 2021

DmitriGekhtman removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 29, 2021

ericl removed their assignment Jan 30, 2021

AmeerHajAli added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 30, 2021

DmitriGekhtman added 5 commits January 31, 2021 18:31

Add option to skip bootstrapping head node autoscaling config

ccb248e

don't close remote config before copying

4ce804e

Type

b984be5

Type hints etc.

4b2b9ff

test

ba6375f

DmitriGekhtman force-pushed the no-monitor-no-bootstrap-config branch from 8f2a9b0 to ba6375f Compare February 1, 2021 02:36

DmitriGekhtman removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 1, 2021

DmitriGekhtman added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Feb 1, 2021

Test CR to config conversion

8622ea6

DmitriGekhtman removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Feb 1, 2021

DmitriGekhtman added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Feb 1, 2021

comment

fe1f811

AmeerHajAli reviewed Feb 2, 2021

View reviewed changes

python/ray/autoscaler/_private/commands.py Show resolved Hide resolved

AmeerHajAli approved these changes Feb 2, 2021

View reviewed changes

ericl approved these changes Feb 4, 2021

View reviewed changes

ericl merged commit db59736 into ray-project:master Feb 4, 2021

fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021

Revert "[autoscaler][kubernetes] Add ability to not copy cluster conf…

d37e8e4

…ig to head node when calling `create_or_update_head_node`. (ray-project#13720)" This reverts commit 729883a.

AmeerHajAli added this to the Serverless Autoscaling milestone Apr 4, 2021

Conversation

DmitriGekhtman commented Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

AmeerHajAli Jan 26, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DmitriGekhtman commented Jan 27, 2021

Uh oh!

dmatch01 Jan 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DmitriGekhtman Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

dmatch01 Jan 29, 2021

Choose a reason for hiding this comment

Uh oh!

AmeerHajAli commented Jan 30, 2021

Uh oh!

DmitriGekhtman commented Feb 1, 2021

Uh oh!

AmeerHajAli commented Feb 1, 2021

Uh oh!

DmitriGekhtman commented Feb 1, 2021 via email

Uh oh!

DmitriGekhtman commented Feb 1, 2021

Uh oh!

DmitriGekhtman commented Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DmitriGekhtman commented Feb 2, 2021

Uh oh!

Uh oh!

AmeerHajAli left a comment

Choose a reason for hiding this comment

Uh oh!

AmeerHajAli commented Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmatch01 commented Feb 4, 2021

Uh oh!

dmatch01 commented Feb 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DmitriGekhtman commented Jan 26, 2021 •

edited

Loading

dmatch01 Jan 27, 2021 •

edited

Loading

DmitriGekhtman commented Feb 1, 2021 •

edited

Loading

AmeerHajAli commented Feb 2, 2021 •

edited

Loading