Skip to content

[Autoscaler] Startup logs are missing #12771

@wuisawesome

Description

@wuisawesome

What is the problem?

Unsure when this happened, but autoscaler logs were originally written to monitor.out, but were moved to their own separate log files (per node). Now they don't seem to exist at all.

This seems to be easy to reproduce (pick your favorite yaml), mine is attached below.

# Experimental: an example of configuring a mixed-node-type cluster.
cluster_name: alex
min_workers: 2
max_workers: 40

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-1
    availability_zone: us-west-1a

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    cpu_4_ondemand:
        node_config:
            InstanceType: m4.xlarge
        # For AWS instances, autoscaler will automatically add the available
        # CPUs/GPUs/accelerator_type ({"CPU": 4} for m4.xlarge) in "resources".
        resources: {"CPU": 4}
        min_workers: 2
        max_workers: 5
    custom1:
        node_config:
            InstanceType: m4.4xlarge
            InstanceMarketOptions:
                MarketType: spot
        resources: {"Custom1": 1}
        max_workers: 10
    custom2:
        node_config:
            InstanceType: m4.4xlarge
            InstanceMarketOptions:
                MarketType: spot
        resources: {"Custom2": 2}
        max_workers: 4

# Specify the node type of the head node (as configured above).
head_node_type: cpu_4_ondemand

# Specify the default type of the worker node (as configured above).
worker_default_node_type: cpu_4_ondemand

# The default settings for the head node. This will be merged with the per-node
# type configs given above.
head_node:
    ImageId: latest_dlami

# The default settings for worker nodes. This will be merged with the per-node
# type configs given above.
worker_nodes:
    ImageId: latest_dlami

setup_commands:
    - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-1.1.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
    - git clone --branch autoscaler_formatted_report https://github.com/wuisawesome/ray.git || true
    - pushd ray; git reset --hard; popd;
    - ray/python/ray/setup-dev.py -y 
    - rm /home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/monitor.py
    - ln -s /home/ubuntu/ray/python/ray/monitor.py /home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/monitor.py


# Configure the cluster for very conservative auto-scaling otherwise.
target_utilization_fraction: 1.0
idle_timeout_minutes: 2

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Metadata

Metadata

Labels

P0Issues that should be fixed in short orderbugSomething that is supposed to be working; but isn'trelease-blockerP0 Issue that blocks the release

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions