Issues for reproducing DDPG in 0.8.0.dev1

### System information
- **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**: macOS Mojave
- **Ray installed from (source or binary)**: binary
- **Ray version**: 0.8.0.dev1
- **Python version**: 3.6.7
- **Exact command to reproduce**:
```
$ rllib train -f tuned_examples/pendulum-ddpg.yaml
$ rllib train -f tuned_examples/mountaincarcontinuous-ddpg.yaml
```



### Describe the problem

Hi. I got some problems with reproducing DDPG for simple continuous control tasks. I think the problem is due to [this line](https://github.com/ray-project/ray/blob/master/python/ray/rllib/tuned_examples/pendulum-ddpg.yaml#L50), which seems not to be supported at 0.8.0.dev1.

### Source code / logs
```
WARNING: Logging before flag parsing goes to stderr.
W0613 00:16:11.083568 4554786240 deprecation.py:323] From /anaconda3/envs/marl-rllib/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:61: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
{'max_weight_sync_delay': 400, 'num_replay_buffer_shards': 4, 'debug': False}
dict_items([('max_weight_sync_delay', 400), ('num_replay_buffer_shards', 4), ('debug', False)])
{'optimizer': {'max_weight_sync_delay': 400, 'num_replay_buffer_shards': 4, 'debug': False}, 'n_step': 3, 'num_gpus': 1, 'num_workers': 32, 'buffer_size': 2000000, 'learning_starts': 50000, 'train_batch_size': 512, 'sample_batch_size': 50, 'target_network_update_freq': 500000, 'timesteps_per_iteration': 25000, 'per_worker_exploration': True, 'worker_side_prioritization': True, 'min_iter_time_s': 30}
dict_items([('optimizer', {'max_weight_sync_delay': 400, 'num_replay_buffer_shards': 4, 'debug': False}), ('n_step', 3), ('num_gpus', 1), ('num_workers', 32), ('buffer_size', 2000000), ('learning_starts', 50000), ('train_batch_size', 512), ('sample_batch_size', 50), ('target_network_update_freq', 500000), ('timesteps_per_iteration', 25000), ('per_worker_exploration', True), ('worker_side_prioritization', True), ('min_iter_time_s', 30)])
{'max_weight_sync_delay': 400, 'num_replay_buffer_shards': 4, 'debug': False}
dict_items([('max_weight_sync_delay', 400), ('num_replay_buffer_shards', 4), ('debug', False)])
{'max_weight_sync_delay': 400, 'num_replay_buffer_shards': 4, 'debug': False}
dict_items([('max_weight_sync_delay', 400), ('num_replay_buffer_shards', 4), ('debug', False)])
{'optimizer': {'max_weight_sync_delay': 400, 'num_replay_buffer_shards': 4, 'debug': False}, 'n_step': 3, 'num_gpus': 0, 'num_workers': 32, 'buffer_size': 2000000, 'learning_starts': 50000, 'train_batch_size': 512, 'sample_batch_size': 50, 'target_network_update_freq': 500000, 'timesteps_per_iteration': 25000, 'per_worker_exploration': True, 'worker_side_prioritization': True, 'min_iter_time_s': 30}
dict_items([('optimizer', {'max_weight_sync_delay': 400, 'num_replay_buffer_shards': 4, 'debug': False}), ('n_step', 3), ('num_gpus', 0), ('num_workers', 32), ('buffer_size', 2000000), ('learning_starts', 50000), ('train_batch_size', 512), ('sample_batch_size', 50), ('target_network_update_freq', 500000), ('timesteps_per_iteration', 25000), ('per_worker_exploration', True), ('worker_side_prioritization', True), ('min_iter_time_s', 30)])
{'max_weight_sync_delay': 400, 'num_replay_buffer_shards': 4, 'debug': False}
dict_items([('max_weight_sync_delay', 400), ('num_replay_buffer_shards', 4), ('debug', False)])
{'twin_q': True, 'policy_delay': 2, 'smooth_target_policy': True, 'target_noise': 0.2, 'target_noise_clip': 0.5, 'exploration_should_anneal': False, 'exploration_noise_type': 'gaussian', 'exploration_gaussian_sigma': 0.1, 'learning_starts': 10000, 'pure_exploration_steps': 10000, 'actor_hiddens': [400, 300], 'critic_hiddens': [400, 300], 'n_step': 1, 'gamma': 0.99, 'actor_lr': 0.001, 'critic_lr': 0.001, 'l2_reg': 0.0, 'tau': 0.005, 'train_batch_size': 100, 'use_huber': False, 'target_network_update_freq': 0, 'num_workers': 0, 'num_gpus_per_worker': 0, 'per_worker_exploration': False, 'worker_side_prioritization': False, 'buffer_size': 1000000, 'prioritized_replay': False, 'clip_rewards': False, 'use_state_preprocessor': False}
dict_items([('twin_q', True), ('policy_delay', 2), ('smooth_target_policy', True), ('target_noise', 0.2), ('target_noise_clip', 0.5), ('exploration_should_anneal', False), ('exploration_noise_type', 'gaussian'), ('exploration_gaussian_sigma', 0.1), ('learning_starts', 10000), ('pure_exploration_steps', 10000), ('actor_hiddens', [400, 300]), ('critic_hiddens', [400, 300]), ('n_step', 1), ('gamma', 0.99), ('actor_lr', 0.001), ('critic_lr', 0.001), ('l2_reg', 0.0), ('tau', 0.005), ('train_batch_size', 100), ('use_huber', False), ('target_network_update_freq', 0), ('num_workers', 0), ('num_gpus_per_worker', 0), ('per_worker_exploration', False), ('worker_side_prioritization', False), ('buffer_size', 1000000), ('prioritized_replay', False), ('clip_rewards', False), ('use_state_preprocessor', False)])
{'sample_batch_size': 20, 'min_iter_time_s': 10, 'sample_async': False}
dict_items([('sample_batch_size', 20), ('min_iter_time_s', 10), ('sample_async', False)])
/anaconda3/envs/marl-rllib/lib/python3.6/site-packages/ray/rllib/train.py:100: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  experiments = yaml.load(f)
2019-06-13 00:16:11,618	WARNING worker.py:1340 -- WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
2019-06-13 00:16:11,620	INFO node.py:498 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-06-13_00-16-11_619077_10206/logs.
2019-06-13 00:16:11,728	INFO services.py:409 -- Waiting for redis server at 127.0.0.1:48724 to respond...
2019-06-13 00:16:11,842	INFO services.py:409 -- Waiting for redis server at 127.0.0.1:33113 to respond...
2019-06-13 00:16:11,845	INFO services.py:806 -- Starting Redis shard with 3.44 GB max memory.
2019-06-13 00:16:11,859	INFO node.py:512 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-06-13_00-16-11_619077_10206/logs.
2019-06-13 00:16:11,860	INFO services.py:1442 -- Starting the Plasma object store with 5.15 GB memory using /tmp.
2019-06-13 00:16:12,412	INFO tune.py:61 -- Tip: to resume incomplete experiments, pass resume='prompt' or resume=True to run()
2019-06-13 00:16:12,413	INFO tune.py:232 -- Starting a new experiment.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/12 CPUs, 0/0 GPUs
Memory usage on this node: 11.4/17.2 GB

2019-06-13 00:16:12,449	WARNING signature.py:108 -- The function with_updates has a **kwargs argument, which is currently not supported.
W0613 00:16:12.453155 4554786240 deprecation_wrapper.py:119] From /anaconda3/envs/marl-rllib/lib/python3.6/site-packages/ray/tune/logger.py:136: The name tf.VERSION is deprecated. Please use tf.version.VERSION instead.

W0613 00:16:12.453705 4554786240 deprecation_wrapper.py:119] From /anaconda3/envs/marl-rllib/lib/python3.6/site-packages/ray/tune/logger.py:141: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 1/12 CPUs, 0/0 GPUs
Memory usage on this node: 11.4/17.2 GB
Result logdir: /Users/wsjeon/ray_results/pendulum-ddpg
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - DDPG_Pendulum-v0_0:	RUNNING

(pid=10223) WARNING: Logging before flag parsing goes to stderr.
(pid=10223) W0613 00:16:13.760989 4569867712 deprecation.py:323] From /anaconda3/envs/marl-rllib/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:61: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
(pid=10223) Instructions for updating:
(pid=10223) non-resource variables are not supported in the long term
(pid=10223) {'max_weight_sync_delay': 400, 'num_replay_buffer_shards': 4, 'debug': False}
(pid=10223) dict_items([('max_weight_sync_delay', 400), ('num_replay_buffer_shards', 4), ('debug', False)])
(pid=10223) {'optimizer': {'max_weight_sync_delay': 400, 'num_replay_buffer_shards': 4, 'debug': False}, 'n_step': 3, 'num_gpus': 1, 'num_workers': 32, 'buffer_size': 2000000, 'learning_starts': 50000, 'train_batch_size': 512, 'sample_batch_size': 50, 'target_network_update_freq': 500000, 'timesteps_per_iteration': 25000, 'per_worker_exploration': True, 'worker_side_prioritization': True, 'min_iter_time_s': 30}
(pid=10223) dict_items([('optimizer', {'max_weight_sync_delay': 400, 'num_replay_buffer_shards': 4, 'debug': False}), ('n_step', 3), ('num_gpus', 1), ('num_workers', 32), ('buffer_size', 2000000), ('learning_starts', 50000), ('train_batch_size', 512), ('sample_batch_size', 50), ('target_network_update_freq', 500000), ('timesteps_per_iteration', 25000), ('per_worker_exploration', True), ('worker_side_prioritization', True), ('min_iter_time_s', 30)])
(pid=10223) {'max_weight_sync_delay': 400, 'num_replay_buffer_shards': 4, 'debug': False}
(pid=10223) dict_items([('max_weight_sync_delay', 400), ('num_replay_buffer_shards', 4), ('debug', False)])
(pid=10223) {'max_weight_sync_delay': 400, 'num_replay_buffer_shards': 4, 'debug': False}
(pid=10223) dict_items([('max_weight_sync_delay', 400), ('num_replay_buffer_shards', 4), ('debug', False)])
(pid=10223) {'optimizer': {'max_weight_sync_delay': 400, 'num_replay_buffer_shards': 4, 'debug': False}, 'n_step': 3, 'num_gpus': 0, 'num_workers': 32, 'buffer_size': 2000000, 'learning_starts': 50000, 'train_batch_size': 512, 'sample_batch_size': 50, 'target_network_update_freq': 500000, 'timesteps_per_iteration': 25000, 'per_worker_exploration': True, 'worker_side_prioritization': True, 'min_iter_time_s': 30}
(pid=10223) dict_items([('optimizer', {'max_weight_sync_delay': 400, 'num_replay_buffer_shards': 4, 'debug': False}), ('n_step', 3), ('num_gpus', 0), ('num_workers', 32), ('buffer_size', 2000000), ('learning_starts', 50000), ('train_batch_size', 512), ('sample_batch_size', 50), ('target_network_update_freq', 500000), ('timesteps_per_iteration', 25000), ('per_worker_exploration', True), ('worker_side_prioritization', True), ('min_iter_time_s', 30)])
(pid=10223) {'max_weight_sync_delay': 400, 'num_replay_buffer_shards': 4, 'debug': False}
(pid=10223) dict_items([('max_weight_sync_delay', 400), ('num_replay_buffer_shards', 4), ('debug', False)])
(pid=10223) {'twin_q': True, 'policy_delay': 2, 'smooth_target_policy': True, 'target_noise': 0.2, 'target_noise_clip': 0.5, 'exploration_should_anneal': False, 'exploration_noise_type': 'gaussian', 'exploration_gaussian_sigma': 0.1, 'learning_starts': 10000, 'pure_exploration_steps': 10000, 'actor_hiddens': [400, 300], 'critic_hiddens': [400, 300], 'n_step': 1, 'gamma': 0.99, 'actor_lr': 0.001, 'critic_lr': 0.001, 'l2_reg': 0.0, 'tau': 0.005, 'train_batch_size': 100, 'use_huber': False, 'target_network_update_freq': 0, 'num_workers': 0, 'num_gpus_per_worker': 0, 'per_worker_exploration': False, 'worker_side_prioritization': False, 'buffer_size': 1000000, 'prioritized_replay': False, 'clip_rewards': False, 'use_state_preprocessor': False}
(pid=10223) dict_items([('twin_q', True), ('policy_delay', 2), ('smooth_target_policy', True), ('target_noise', 0.2), ('target_noise_clip', 0.5), ('exploration_should_anneal', False), ('exploration_noise_type', 'gaussian'), ('exploration_gaussian_sigma', 0.1), ('learning_starts', 10000), ('pure_exploration_steps', 10000), ('actor_hiddens', [400, 300]), ('critic_hiddens', [400, 300]), ('n_step', 1), ('gamma', 0.99), ('actor_lr', 0.001), ('critic_lr', 0.001), ('l2_reg', 0.0), ('tau', 0.005), ('train_batch_size', 100), ('use_huber', False), ('target_network_update_freq', 0), ('num_workers', 0), ('num_gpus_per_worker', 0), ('per_worker_exploration', False), ('worker_side_prioritization', False), ('buffer_size', 1000000), ('prioritized_replay', False), ('clip_rewards', False), ('use_state_preprocessor', False)])
(pid=10223) {'sample_batch_size': 20, 'min_iter_time_s': 10, 'sample_async': False}
(pid=10223) dict_items([('sample_batch_size', 20), ('min_iter_time_s', 10), ('sample_async', False)])
2019-06-13 00:16:14,056	ERROR trial_runner.py:487 -- Error processing event.
Traceback (most recent call last):
  File "/anaconda3/envs/marl-rllib/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 436, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/anaconda3/envs/marl-rllib/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 323, in fetch_result
    result = ray.get(trial_future[0])
  File "/anaconda3/envs/marl-rllib/lib/python3.6/site-packages/ray/worker.py", line 2198, in get
    raise value
ray.exceptions.RayTaskError: ray_worker (pid=10223, host=wsjeonMCBOOKPRO)
  File "/anaconda3/envs/marl-rllib/lib/python3.6/site-packages/ray/rllib/agents/trainer_template.py", line 87, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "/anaconda3/envs/marl-rllib/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 323, in __init__
    Trainable.__init__(self, config, logger_creator)
  File "/anaconda3/envs/marl-rllib/lib/python3.6/site-packages/ray/tune/trainable.py", line 87, in __init__
    self._setup(copy.deepcopy(self.config))
  File "/anaconda3/envs/marl-rllib/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 424, in _setup
    self._allow_unknown_subkeys)
  File "/anaconda3/envs/marl-rllib/lib/python3.6/site-packages/ray/tune/util.py", line 93, in deep_update
    raise Exception("Unknown config parameter `{}` ".format(k))
Exception: Unknown config parameter `optimizer_class`

2019-06-13 00:16:14,060	INFO ray_trial_executor.py:187 -- Destroying actor for trial DDPG_Pendulum-v0_0. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/12 CPUs, 0/0 GPUs
Memory usage on this node: 11.3/17.2 GB
Result logdir: /Users/wsjeon/ray_results/pendulum-ddpg
Number of trials: 1 ({'ERROR': 1})
ERROR trials:
 - DDPG_Pendulum-v0_0:	ERROR, 1 failures: /Users/wsjeon/ray_results/pendulum-ddpg/DDPG_Pendulum-v0_0_2019-06-13_00-16-12xifin2zq/error_2019-06-13_00-16-14.txt

Traceback (most recent call last):
  File "/anaconda3/envs/marl-rllib/bin/rllib", line 10, in <module>
    sys.exit(cli())
  File "/anaconda3/envs/marl-rllib/lib/python3.6/site-packages/ray/rllib/scripts.py", line 38, in cli
    train.run(options, train_parser)
  File "/anaconda3/envs/marl-rllib/lib/python3.6/site-packages/ray/rllib/train.py", line 147, in run
    resume=args.resume)
  File "/anaconda3/envs/marl-rllib/lib/python3.6/site-packages/ray/tune/tune.py", line 330, in run_experiments
    raise_on_failed_trial=raise_on_failed_trial)
  File "/anaconda3/envs/marl-rllib/lib/python3.6/site-packages/ray/tune/tune.py", line 272, in run
    raise TuneError("Trials did not complete", errored_trials)
ray.tune.error.TuneError: ('Trials did not complete', [DDPG_Pendulum-v0_0])
(pid=10223) {'actor_hiddens': [64, 64], 'critic_hiddens': [64, 64], 'n_step': 1, 'model': {}, 'gamma': 0.99, 'env_config': {}, 'exploration_should_anneal': True, 'schedule_max_timesteps': 100000, 'timesteps_per_iteration': 600, 'exploration_fraction': 0.1, 'exploration_final_scale': 0.02, 'exploration_ou_noise_scale': 0.1, 'exploration_ou_theta': 0.15, 'exploration_ou_sigma': 0.2, 'target_network_update_freq': 0, 'tau': 0.001, 'buffer_size': 10000, 'prioritized_replay': True, 'prioritized_replay_alpha': 0.6, 'prioritized_replay_beta': 0.4, 'prioritized_replay_eps': 1e-06, 'clip_rewards': False, 'actor_lr': 0.001, 'critic_lr': 0.001, 'use_huber': True, 'huber_threshold': 1.0, 'l2_reg': 1e-06, 'learning_starts': 500, 'sample_batch_size': 1, 'train_batch_size': 64, 'num_workers': 0, 'num_gpus_per_worker': 0, 'optimizer_class': 'SyncReplayOptimizer', 'per_worker_exploration': False, 'worker_side_prioritization': False, 'evaluation_interval': 5, 'evaluation_num_episodes': 10, 'env': 'Pendulum-v0'}
(pid=10223) dict_items([('actor_hiddens', [64, 64]), ('critic_hiddens', [64, 64]), ('n_step', 1), ('model', {}), ('gamma', 0.99), ('env_config', {}), ('exploration_should_anneal', True), ('schedule_max_timesteps', 100000), ('timesteps_per_iteration', 600), ('exploration_fraction', 0.1), ('exploration_final_scale', 0.02), ('exploration_ou_noise_scale', 0.1), ('exploration_ou_theta', 0.15), ('exploration_ou_sigma', 0.2), ('target_network_update_freq', 0), ('tau', 0.001), ('buffer_size', 10000), ('prioritized_replay', True), ('prioritized_replay_alpha', 0.6), ('prioritized_replay_beta', 0.4), ('prioritized_replay_eps', 1e-06), ('clip_rewards', False), ('actor_lr', 0.001), ('critic_lr', 0.001), ('use_huber', True), ('huber_threshold', 1.0), ('l2_reg', 1e-06), ('learning_starts', 500), ('sample_batch_size', 1), ('train_batch_size', 64), ('num_workers', 0), ('num_gpus_per_worker', 0), ('optimizer_class', 'SyncReplayOptimizer'), ('per_worker_exploration', False), ('worker_side_prioritization', False), ('evaluation_interval', 5), ('evaluation_num_episodes', 10), ('env', 'Pendulum-v0')])
(pid=10223) {}
(pid=10223) dict_items([])
(pid=10223) {}
(pid=10223) dict_items([])
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues for reproducing DDPG in 0.8.0.dev1 #4972

System information

Describe the problem

Source code / logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issues for reproducing DDPG in 0.8.0.dev1 #4972

Description

System information

Describe the problem

Source code / logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions