Skip to content

Occasional KeyError when running evolution strategies on two machines. #1446

@robertnishihara

Description

@robertnishihara

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • Ray installed from (source or binary): source
  • Ray version: 215d526
  • Python version: 3.6.2 Anaconda
  • Exact command to reproduce:
  1. On machine 1

    ray start --head --redis-port=6379 --num-workers=0
    
  2. On machine 2

    ray start --redis-address <head-node-ip>:6379 --num-workers=0
    
  3. On machine 1

    cd ray/python/ray/rllib
    python train.py --run=ES --env=CartPole-v0 --redis-address=<head-node-ip>:6379
    

About half of the time, this fails with

$ python train.py --run=ES --env=CartPole-v0 --redis-address=172.31.5.255:6379
/home/ubuntu/anaconda3/lib/python3.6/importlib/_bootstrap.py:205: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
  return f(*args, **kwds)
== Status ==
Using FIFO scheduling algorithm.
Result logdir: /home/ubuntu/ray_results/default
 - ES_CartPole-v0_0:	PENDING

Unified logger created with logdir '/home/ubuntu/ray_results/default/ES_CartPole-v0_0_2018-01-19_01-37-30wdhanz66'
== Status ==
Using FIFO scheduling algorithm.
Resources used: 1/8 CPUs, 0/0 GPUs
Result logdir: /home/ubuntu/ray_results/default
 - ES_CartPole-v0_0:	RUNNING

Remote function __init__ failed with:

Traceback (most recent call last):
  File "/home/ubuntu/ray3/python/ray/worker.py", line 771, in _process_task
    *arguments)
  File "/home/ubuntu/ray3/python/ray/actor.py", line 196, in actor_method_executor
    return method(actor, *args)
  File "/home/ubuntu/ray3/python/ray/rllib/agent.py", line 127, in __init__
    self._init()
  File "/home/ubuntu/ray3/python/ray/rllib/es/es.py", line 157, in _init
    noise_id = create_shared_noise.remote()
  File "/home/ubuntu/ray3/python/ray/worker.py", line 2509, in func_call
    objectids = _submit_task(function_id, args)
  File "/home/ubuntu/ray3/python/ray/worker.py", line 2364, in _submit_task
    return worker.submit_task(function_id, args)
  File "/home/ubuntu/ray3/python/ray/worker.py", line 543, in submit_task
    self.task_driver_id.id()][function_id.id()]
KeyError: b'Z`\xd9\xd5?/\x88\x04>\xa4Xph\xb9\xe3\xca\xf4\xa1\x1b\x13'


  You can inspect errors by running

      ray.error_info()

  If this driver is hanging, start a new one with

      ray.init(redis_address="172.31.5.255:6379")
  
Remote function train failed with:

Traceback (most recent call last):
  File "/home/ubuntu/ray3/python/ray/worker.py", line 771, in _process_task
    *arguments)
  File "/home/ubuntu/ray3/python/ray/actor.py", line 196, in actor_method_executor
    return method(actor, *args)
  File "/home/ubuntu/ray3/python/ray/rllib/agent.py", line 145, in train
    "Agent initialization failed, see previous errors")
ValueError: Agent initialization failed, see previous errors


  You can inspect errors by running

      ray.error_info()

  If this driver is hanging, start a new one with

      ray.init(redis_address="172.31.5.255:6379")
  
Error processing event: Traceback (most recent call last):
  File "/home/ubuntu/ray3/python/ray/tune/trial_runner.py", line 162, in _process_events
    result = ray.get(result_id)
  File "/home/ubuntu/ray3/python/ray/worker.py", line 2240, in get
    raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(a87f1adc2ec2e19f0199e246b9f733c6ea16750c). It was created by remote function train which failed with:

Remote function train failed with:

Traceback (most recent call last):
  File "/home/ubuntu/ray3/python/ray/worker.py", line 771, in _process_task
    *arguments)
  File "/home/ubuntu/ray3/python/ray/actor.py", line 196, in actor_method_executor
    return method(actor, *args)
  File "/home/ubuntu/ray3/python/ray/rllib/agent.py", line 145, in train
    "Agent initialization failed, see previous errors")
ValueError: Agent initialization failed, see previous errors


Stopping ES_CartPole-v0_0 Actor timed out, but moving on...
== Status ==
Using FIFO scheduling algorithm.
Resources used: 0/8 CPUs, 0/0 GPUs
Result logdir: /home/ubuntu/ray_results/default
 - ES_CartPole-v0_0:	ERROR

Traceback (most recent call last):
  File "train.py", line 82, in <module>
    num_cpus=args.num_cpus, num_gpus=args.num_gpus)
  File "/home/ubuntu/ray3/python/ray/tune/tune.py", line 82, in run_experiments
    raise TuneError("Trial did not complete", trial)
ray.tune.error.TuneError: ('Trial did not complete', <ray.tune.trial.Trial object at 0x7f30baab6c18>)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn't

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions