[RFC] Improved autoscaler log messages

The current autoscaler output is quite difficult to interpret due to its verbosity and low-level details. This is a proposal to clean it by periodically emitting the following summary table:

```
======== Autoscaler status 2020-11-20 23:14:36,653 ========
Node status
------------------------------------------------------------
Healthy:
 2 p3.2xlarge (2 active)
 20 m4.4xlarge (18 active, 2 idle)

Pending:
 34.5.234.51: m4.4xlarge, launching
 34.5.234.52: m4.4xlarge, launching
 34.5.234.53: m4.4xlarge, waiting for ssh
 34.5.234.54: m4.4xlarge, waiting for ssh
 34.5.234.55: m4.4xlarge, starting ray, /tmp/ray/setup-10.log
 34.5.234.56: m4.4xlarge, setting up, /tmp/ray/setup-11.log
 34.5.234.57: m4.4xlarge, setting up, /tmp/ray/setup-12.log

Recent failures:
 172.24.25.33: m4.4xlarge, /tmp/ray/setup-8.log
 35.4.235.11: p3.2xlarge, /tmp/ray/setup-9.log

Resources
------------------------------------------------------------
Usage:
 530.0/544.0 CPU
 2.0/2.0 GPU
 0.0/2.0 AcceleratorType:V100
 0.0 GiB/1583.19 GiB memory
 0.0 GiB/471.02 GiB object_store_memory

Demands:
 {"CPU": 1}: 150 pending tasks
 [{"CPU": 4} * 5]: 5 pending placement groups
 [{"CPU": 1} * 100]: from request_resources()
```

Implementation details:
- The autoscaler should periodically generate a JSON status message that includes the above information.
- We should log the above text summary for of the JSON status every 10-30s.
- Other ray components such as the dashboard and `ray status` can also access this information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Improved autoscaler log messages #12221

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Improved autoscaler log messages #12221

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions