Skip to content

Local cluster YAML no longer working in 0.9.0.dev0 #7632

@arsedler9

Description

@arsedler9

What is the problem?

With my previous version of Ray (0.7.7), I had a cluster.yaml file that worked well, but it has since stopped working since I upgraded to 0.9.0.dev0 to include a recent tune bug fix for PAUSED trials. When I run a test script after running ray up cluster.yaml, only the head node is visible and I’m getting this warning:
2020-03-16 19:48:44,344 WARNING worker.py:802 -- When connecting to an existing cluster, _internal_config must match the cluster's _internal_config.
I know that there is a firewall between my machines, so I had to open specific ports and force Ray to use them in my cluster YAML file previously, so maybe there were some new port changes that are blocking communication?

Ray version and other system information (Python version, TensorFlow version, OS):
Ray: 0.9.0.dev0
OS: Centos 7

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

My cluster.yaml is:

cluster_name: asedler_nesu
​
## NOTE: Typically for local clusters, min_workers == initial_workers == max_workers.
​
# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == initial_workers == max_workers.
min_workers: 1
# The initial number of worker nodes to launch in addition to the head node.
# Typically, min_workers == initial_workers == max_workers.
initial_workers: 1
​
# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == initial_workers == max_workers.
max_workers: 1
​
# Autoscaling parameters.
# Ignore this if min_workers == initial_workers == max_workers.
autoscaling_mode: default
target_utilization_fraction: 0.8
idle_timeout_minutes: 5
​
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled. Assumes Docker is installed.
docker:
    image: "" # e.g., tensorflow/tensorflow:1.5.0-py3
    container_name: "" # e.g. ray_docker
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options: []  # Extra options to pass into "docker run"
​
# Local specific configuration.
provider:
    type: local
    head_ip: neuron.bme.emory.edu
    worker_ips: 
        - sulcus.bme.emory.edu
​
# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: asedler
    ssh_private_key: ~/.ssh/id_rsa
​
# Leave this empty.
head_node: {}
​
# Leave this empty.
worker_nodes: {}
​
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}
​
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
​
# List of shell commands to run to set up each nodes.
setup_commands: []
​
# Custom commands that will be run on the head node after common setup.
head_setup_commands: []
​
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
​
# NOTE: Modified the following commands to use the tf2-gpu environment 
# and to use specific ports that have been opened for this purpose
# by Andrew Sedler (asedler3@gatech.edu)
​
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - conda activate tf2-gpu && ray stop
    - conda activate tf2-gpu && ulimit -c unlimited && ray start --head --redis-port=6379 --redis-shard-ports=59519 --node-manager-port=19580 --object-manager-port=39066 --autoscaling-config=~/ray_bootstrap_config.yaml
​
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - conda activate tf2-gpu && ray stop
    - conda activate tf2-gpu && ray start --redis-address=$RAY_HEAD_IP:6379 --node-manager-port=19580 --object-manager-port=39066

The test script is:

ray.init(address="localhost:6379")
import time
from pprint import pprint
@ray.remote
def f():
    time.sleep(0.01)
    return ray.services.get_node_ip_address()
# Get a list of the IP addresses of the nodes that have joined the cluster.
pprint(set(ray.get([f.remote() for _ in range(1000)])))

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn't

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions