Skip to content

Add "ray status" debug tool for autoscaler.#9091

Merged
ericl merged 9 commits intoray-project:masterfrom
ericl:debug-autoscaler-str
Jun 25, 2020
Merged

Add "ray status" debug tool for autoscaler.#9091
ericl merged 9 commits intoray-project:masterfrom
ericl:debug-autoscaler-str

Conversation

@ericl
Copy link
Copy Markdown
Contributor

@ericl ericl commented Jun 22, 2020

Why are these changes needed?

This adds a "ray status" cluster debug tool for the autoscaler. It can be used to inspect the current autoscaling state instead of trying to read cluster logs.

Also, some minor improvements to autoscaler logging.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/latest/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested (please justify below)

@AmplabJenkins
Copy link
Copy Markdown

Can one of the admins verify this patch?

Comment on lines +1104 to +1110
def status(address):
"""Print cluster status, including autoscaling info."""
if not address:
address = services.find_redis_address_or_die()
logger.info("Connecting to Ray instance at {}.".format(address))
ray.init(address=address)
print(debug_status())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, wouldn't this be more useful if it were used like:

ray status [cluster_yaml], similar to ray dashboard cluster.yaml?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of ray commands that are intended to run on the currently active cluster such as ray memory, etc. This is consistent with those.

@richardliaw
Copy link
Copy Markdown
Contributor

Looks very useful, can you post a print output?

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27443/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27442/
Test FAILed.

@ericl
Copy link
Copy Markdown
Contributor Author

ericl commented Jun 23, 2020

The output looks like this:

Cluster status: 0/0 target nodes (0 pending)
 - MostDelayedHeartbeats: {}
 - NodeIdleSeconds: Min=-1 Mean=-1 Max=-1
 - NumNodesConnected: 0
 - NumNodesUsed: 0.0
 - ResourceUsage: 
 - TimeSinceLastHeartbeat: Min=-1 Mean=-1 Max=-1

exec_cluster(config_file, "ray stop", False, False, False, False,
False, override_cluster_name, None, False)
except Exception:
logger.exception("Ignoring error attempting a clean shutdown.")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ijrsvt otherwise teardown fails if ray is misconfigured

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27450/
Test FAILed.

Copy link
Copy Markdown
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27493/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27536/
Test FAILed.

@ericl ericl merged commit 0ff24ec into ray-project:master Jun 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants