Skip to content

[RFC] Improved autoscaler log messages #12221

@ericl

Description

@ericl

The current autoscaler output is quite difficult to interpret due to its verbosity and low-level details. This is a proposal to clean it by periodically emitting the following summary table:

======== Autoscaler status 2020-11-20 23:14:36,653 ========
Node status
------------------------------------------------------------
Healthy:
 2 p3.2xlarge (2 active)
 20 m4.4xlarge (18 active, 2 idle)

Pending:
 34.5.234.51: m4.4xlarge, launching
 34.5.234.52: m4.4xlarge, launching
 34.5.234.53: m4.4xlarge, waiting for ssh
 34.5.234.54: m4.4xlarge, waiting for ssh
 34.5.234.55: m4.4xlarge, starting ray, /tmp/ray/setup-10.log
 34.5.234.56: m4.4xlarge, setting up, /tmp/ray/setup-11.log
 34.5.234.57: m4.4xlarge, setting up, /tmp/ray/setup-12.log

Recent failures:
 172.24.25.33: m4.4xlarge, /tmp/ray/setup-8.log
 35.4.235.11: p3.2xlarge, /tmp/ray/setup-9.log

Resources
------------------------------------------------------------
Usage:
 530.0/544.0 CPU
 2.0/2.0 GPU
 0.0/2.0 AcceleratorType:V100
 0.0 GiB/1583.19 GiB memory
 0.0 GiB/471.02 GiB object_store_memory

Demands:
 {"CPU": 1}: 150 pending tasks
 [{"CPU": 4} * 5]: 5 pending placement groups
 [{"CPU": 1} * 100]: from request_resources()

Implementation details:

  • The autoscaler should periodically generate a JSON status message that includes the above information.
  • We should log the above text summary for of the JSON status every 10-30s.
  • Other ray components such as the dashboard and ray status can also access this information.

Metadata

Metadata

Labels

P1Issue that should be fixed within a few weeksRFCRFC issuesenhancementRequest for new feature and/or capabilityfix-error-msgThis issue has a bad error message that should be improved.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions