Add "ray status" debug tool for autoscaler. by ericl · Pull Request #9091 · ray-project/ray

ericl · 2020-06-22T21:55:53Z

Why are these changes needed?

This adds a "ray status" cluster debug tool for the autoscaler. It can be used to inspect the current autoscaling state instead of trying to read cluster logs.

Also, some minor improvements to autoscaler logging.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/latest/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested (please justify below)

AmplabJenkins · 2020-06-22T22:03:00Z

Can one of the admins verify this patch?

richardliaw · 2020-06-22T22:09:07Z

python/ray/scripts/scripts.py

+def status(address):
+    """Print cluster status, including autoscaling info."""
+    if not address:
+        address = services.find_redis_address_or_die()
+    logger.info("Connecting to Ray instance at {}.".format(address))
+    ray.init(address=address)
+    print(debug_status())


Hey, wouldn't this be more useful if it were used like:

ray status [cluster_yaml], similar to ray dashboard cluster.yaml?

There are a number of ray commands that are intended to run on the currently active cluster such as ray memory, etc. This is consistent with those.

richardliaw · 2020-06-22T22:09:35Z

Looks very useful, can you post a print output?

AmplabJenkins · 2020-06-22T23:03:40Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27443/
Test FAILed.

AmplabJenkins · 2020-06-22T23:27:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27442/
Test FAILed.

ericl · 2020-06-23T00:58:00Z

The output looks like this:

Cluster status: 0/0 target nodes (0 pending)
 - MostDelayedHeartbeats: {}
 - NodeIdleSeconds: Min=-1 Mean=-1 Max=-1
 - NumNodesConnected: 0
 - NumNodesUsed: 0.0
 - ResourceUsage: 
 - TimeSinceLastHeartbeat: Min=-1 Mean=-1 Max=-1

ericl · 2020-06-23T01:00:46Z

python/ray/autoscaler/commands.py

+            exec_cluster(config_file, "ray stop", False, False, False, False,
+                         False, override_cluster_name, None, False)
+        except Exception:
+            logger.exception("Ignoring error attempting a clean shutdown.")


@ijrsvt otherwise teardown fails if ray is misconfigured

AmplabJenkins · 2020-06-23T02:33:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27450/
Test FAILed.

wuisawesome

lgtm

AmplabJenkins · 2020-06-23T22:11:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27493/
Test FAILed.

AmplabJenkins · 2020-06-24T23:00:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27536/
Test FAILed.

ericl added 5 commits June 22, 2020 14:46

debug state

a976782

fix

f9d0559

update

70fb49b

update

362b9d2

fix test

05cb903

ericl requested a review from richardliaw June 22, 2020 22:00

ericl assigned wuisawesome Jun 22, 2020

richardliaw reviewed Jun 22, 2020

View reviewed changes

ericl added 2 commits June 22, 2020 17:59

fix

4aec3e5

fix

3c0017d

ericl commented Jun 23, 2020

View reviewed changes

ericl mentioned this pull request Jun 23, 2020

[autoscaler] Initial support for multiple worker types #9096

Merged

6 tasks

wuisawesome approved these changes Jun 23, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into debug-autoscaler-str

4464169

fix

6be7344

ericl merged commit 0ff24ec into ray-project:master Jun 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "ray status" debug tool for autoscaler.#9091

Add "ray status" debug tool for autoscaler.#9091
ericl merged 9 commits intoray-project:masterfrom
ericl:debug-autoscaler-str

ericl commented Jun 22, 2020

Uh oh!

AmplabJenkins commented Jun 22, 2020

Uh oh!

richardliaw Jun 22, 2020

Uh oh!

ericl Jun 23, 2020

Uh oh!

richardliaw commented Jun 22, 2020

Uh oh!

AmplabJenkins commented Jun 22, 2020

Uh oh!

AmplabJenkins commented Jun 22, 2020

Uh oh!

ericl commented Jun 23, 2020

Uh oh!

ericl Jun 23, 2020

Uh oh!

AmplabJenkins commented Jun 23, 2020

Uh oh!

wuisawesome left a comment

Uh oh!

AmplabJenkins commented Jun 23, 2020

Uh oh!

AmplabJenkins commented Jun 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ericl commented Jun 22, 2020

Why are these changes needed?

Related issue number

Checks

Uh oh!

AmplabJenkins commented Jun 22, 2020

Uh oh!

richardliaw Jun 22, 2020

Choose a reason for hiding this comment

Uh oh!

ericl Jun 23, 2020

Choose a reason for hiding this comment

Uh oh!

richardliaw commented Jun 22, 2020

Uh oh!

AmplabJenkins commented Jun 22, 2020

Uh oh!

AmplabJenkins commented Jun 22, 2020

Uh oh!

ericl commented Jun 23, 2020

Uh oh!

ericl Jun 23, 2020

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Jun 23, 2020

Uh oh!

wuisawesome left a comment

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Jun 23, 2020

Uh oh!

AmplabJenkins commented Jun 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants