-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
P0Issues that should be fixed in short orderIssues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'trelease-blockerP0 Issue that blocks the releaseP0 Issue that blocks the release
Description
What is the problem?
Unsure when this happened, but autoscaler logs were originally written to monitor.out, but were moved to their own separate log files (per node). Now they don't seem to exist at all.
This seems to be easy to reproduce (pick your favorite yaml), mine is attached below.
# Experimental: an example of configuring a mixed-node-type cluster.
cluster_name: alex
min_workers: 2
max_workers: 40
# Cloud-provider specific configuration.
provider:
type: aws
region: us-west-1
availability_zone: us-west-1a
# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
cpu_4_ondemand:
node_config:
InstanceType: m4.xlarge
# For AWS instances, autoscaler will automatically add the available
# CPUs/GPUs/accelerator_type ({"CPU": 4} for m4.xlarge) in "resources".
resources: {"CPU": 4}
min_workers: 2
max_workers: 5
custom1:
node_config:
InstanceType: m4.4xlarge
InstanceMarketOptions:
MarketType: spot
resources: {"Custom1": 1}
max_workers: 10
custom2:
node_config:
InstanceType: m4.4xlarge
InstanceMarketOptions:
MarketType: spot
resources: {"Custom2": 2}
max_workers: 4
# Specify the node type of the head node (as configured above).
head_node_type: cpu_4_ondemand
# Specify the default type of the worker node (as configured above).
worker_default_node_type: cpu_4_ondemand
# The default settings for the head node. This will be merged with the per-node
# type configs given above.
head_node:
ImageId: latest_dlami
# The default settings for worker nodes. This will be merged with the per-node
# type configs given above.
worker_nodes:
ImageId: latest_dlami
setup_commands:
- pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-1.1.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
- git clone --branch autoscaler_formatted_report https://github.com/wuisawesome/ray.git || true
- pushd ray; git reset --hard; popd;
- ray/python/ray/setup-dev.py -y
- rm /home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/monitor.py
- ln -s /home/ubuntu/ray/python/ray/monitor.py /home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/monitor.py
# Configure the cluster for very conservative auto-scaling otherwise.
target_utilization_fraction: 1.0
idle_timeout_minutes: 2
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu
Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P0Issues that should be fixed in short orderIssues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'trelease-blockerP0 Issue that blocks the releaseP0 Issue that blocks the release