Occasional KeyError when running evolution strategies on two machines.

### System information
- **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**: Linux Ubuntu 16.04
- **Ray installed from (source or binary)**: source
- **Ray version**: 215d526e0d605e2f090da3c7b1ec66c990bec89c
- **Python version**: 3.6.2 Anaconda
- **Exact command to reproduce**:

1. On machine 1

    ```
    ray start --head --redis-port=6379 --num-workers=0
    ```

2. On machine 2

    ```
    ray start --redis-address <head-node-ip>:6379 --num-workers=0
    ```

3. On machine 1

    ```
    cd ray/python/ray/rllib
    python train.py --run=ES --env=CartPole-v0 --redis-address=<head-node-ip>:6379
    ```

About half of the time, this fails with

```
$ python train.py --run=ES --env=CartPole-v0 --redis-address=172.31.5.255:6379
/home/ubuntu/anaconda3/lib/python3.6/importlib/_bootstrap.py:205: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
  return f(*args, **kwds)
== Status ==
Using FIFO scheduling algorithm.
Result logdir: /home/ubuntu/ray_results/default
 - ES_CartPole-v0_0:	PENDING

Unified logger created with logdir '/home/ubuntu/ray_results/default/ES_CartPole-v0_0_2018-01-19_01-37-30wdhanz66'
== Status ==
Using FIFO scheduling algorithm.
Resources used: 1/8 CPUs, 0/0 GPUs
Result logdir: /home/ubuntu/ray_results/default
 - ES_CartPole-v0_0:	RUNNING

Remote function __init__ failed with:

Traceback (most recent call last):
  File "/home/ubuntu/ray3/python/ray/worker.py", line 771, in _process_task
    *arguments)
  File "/home/ubuntu/ray3/python/ray/actor.py", line 196, in actor_method_executor
    return method(actor, *args)
  File "/home/ubuntu/ray3/python/ray/rllib/agent.py", line 127, in __init__
    self._init()
  File "/home/ubuntu/ray3/python/ray/rllib/es/es.py", line 157, in _init
    noise_id = create_shared_noise.remote()
  File "/home/ubuntu/ray3/python/ray/worker.py", line 2509, in func_call
    objectids = _submit_task(function_id, args)
  File "/home/ubuntu/ray3/python/ray/worker.py", line 2364, in _submit_task
    return worker.submit_task(function_id, args)
  File "/home/ubuntu/ray3/python/ray/worker.py", line 543, in submit_task
    self.task_driver_id.id()][function_id.id()]
KeyError: b'Z`\xd9\xd5?/\x88\x04>\xa4Xph\xb9\xe3\xca\xf4\xa1\x1b\x13'


  You can inspect errors by running

      ray.error_info()

  If this driver is hanging, start a new one with

      ray.init(redis_address="172.31.5.255:6379")
  
Remote function train failed with:

Traceback (most recent call last):
  File "/home/ubuntu/ray3/python/ray/worker.py", line 771, in _process_task
    *arguments)
  File "/home/ubuntu/ray3/python/ray/actor.py", line 196, in actor_method_executor
    return method(actor, *args)
  File "/home/ubuntu/ray3/python/ray/rllib/agent.py", line 145, in train
    "Agent initialization failed, see previous errors")
ValueError: Agent initialization failed, see previous errors


  You can inspect errors by running

      ray.error_info()

  If this driver is hanging, start a new one with

      ray.init(redis_address="172.31.5.255:6379")
  
Error processing event: Traceback (most recent call last):
  File "/home/ubuntu/ray3/python/ray/tune/trial_runner.py", line 162, in _process_events
    result = ray.get(result_id)
  File "/home/ubuntu/ray3/python/ray/worker.py", line 2240, in get
    raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(a87f1adc2ec2e19f0199e246b9f733c6ea16750c). It was created by remote function train which failed with:

Remote function train failed with:

Traceback (most recent call last):
  File "/home/ubuntu/ray3/python/ray/worker.py", line 771, in _process_task
    *arguments)
  File "/home/ubuntu/ray3/python/ray/actor.py", line 196, in actor_method_executor
    return method(actor, *args)
  File "/home/ubuntu/ray3/python/ray/rllib/agent.py", line 145, in train
    "Agent initialization failed, see previous errors")
ValueError: Agent initialization failed, see previous errors


Stopping ES_CartPole-v0_0 Actor timed out, but moving on...
== Status ==
Using FIFO scheduling algorithm.
Resources used: 0/8 CPUs, 0/0 GPUs
Result logdir: /home/ubuntu/ray_results/default
 - ES_CartPole-v0_0:	ERROR

Traceback (most recent call last):
  File "train.py", line 82, in <module>
    num_cpus=args.num_cpus, num_gpus=args.num_gpus)
  File "/home/ubuntu/ray3/python/ray/tune/tune.py", line 82, in run_experiments
    raise TuneError("Trial did not complete", trial)
ray.tune.error.TuneError: ('Trial did not complete', <ray.tune.trial.Trial object at 0x7f30baab6c18>)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional KeyError when running evolution strategies on two machines. #1446

System information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Occasional KeyError when running evolution strategies on two machines. #1446

Description

System information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions