-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksRFCRFC issuesRFC issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilityfix-error-msgThis issue has a bad error message that should be improved.This issue has a bad error message that should be improved.
Milestone
Description
The current autoscaler output is quite difficult to interpret due to its verbosity and low-level details. This is a proposal to clean it by periodically emitting the following summary table:
======== Autoscaler status 2020-11-20 23:14:36,653 ========
Node status
------------------------------------------------------------
Healthy:
2 p3.2xlarge (2 active)
20 m4.4xlarge (18 active, 2 idle)
Pending:
34.5.234.51: m4.4xlarge, launching
34.5.234.52: m4.4xlarge, launching
34.5.234.53: m4.4xlarge, waiting for ssh
34.5.234.54: m4.4xlarge, waiting for ssh
34.5.234.55: m4.4xlarge, starting ray, /tmp/ray/setup-10.log
34.5.234.56: m4.4xlarge, setting up, /tmp/ray/setup-11.log
34.5.234.57: m4.4xlarge, setting up, /tmp/ray/setup-12.log
Recent failures:
172.24.25.33: m4.4xlarge, /tmp/ray/setup-8.log
35.4.235.11: p3.2xlarge, /tmp/ray/setup-9.log
Resources
------------------------------------------------------------
Usage:
530.0/544.0 CPU
2.0/2.0 GPU
0.0/2.0 AcceleratorType:V100
0.0 GiB/1583.19 GiB memory
0.0 GiB/471.02 GiB object_store_memory
Demands:
{"CPU": 1}: 150 pending tasks
[{"CPU": 4} * 5]: 5 pending placement groups
[{"CPU": 1} * 100]: from request_resources()
Implementation details:
- The autoscaler should periodically generate a JSON status message that includes the above information.
- We should log the above text summary for of the JSON status every 10-30s.
- Other ray components such as the dashboard and
ray statuscan also access this information.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksRFCRFC issuesRFC issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilityfix-error-msgThis issue has a bad error message that should be improved.This issue has a bad error message that should be improved.