[rllib] MARWIL tuned cartpole example (and my own experiments) produce nan rewards only.

### What is the problem? + Reproduction

I have a custom example that produces offline data and picks it up with MARWIL for training. I observed that I get `nan` reward values for my example every time, so I went a step back and used your cartpole example:

https://github.com/ray-project/ray/blob/cd5a207d69cdaf05b47d956c18e89d928585eec7/rllib/tuned_examples/marwil/cartpole-marwil.yaml

I'm following the exact steps there, i.e. first run 

```
./train.py --run=PPO --env=CartPole-v0 \
    --stop='{"timesteps_total": 50000}' \
    --config='{"output": "/tmp/out", "batch_mode": "complete_episodes"}'
```

followed by 

```
rllib train -f cartpole-marwil.yaml
```

I did this both on my currently preferred stable version `0.8.5`, as well as on the `0.9.0.dev0` wheel. The result is this:

```
== Status ==
Memory usage on this node: 19.4/32.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/12 CPUs, 0/0 GPUs, 0.0/9.96 GiB heap, 0.0/3.42 GiB objects
Result logdir: /Users/maxpumperla/ray_results/cartpole-marwil
Number of trials: 2 (2 TERMINATED)
+--------------------------------+------------+-------+--------+--------+------------------+--------+----------+
| Trial name                     | status     | loc   |   beta |   iter |   total time (s) |     ts |   reward |
|--------------------------------+------------+-------+--------+--------+------------------+--------+----------|
| MARWIL_CartPole-v0_7af06_00000 | TERMINATED |       |      0 |   2206 |          58.5661 | 500007 |      nan |
| MARWIL_CartPole-v0_7af06_00001 | TERMINATED |       |      1 |   2248 |          58.6117 | 500286 |      nan |
+--------------------------------+------------+-------+--------+--------+------------------+--------+----------+
```

Also, I've noticed that your MARWIL unit test is a pure smoke test and doesn't check reward values, but I didn't run that locally. Maybe it produces nan values as well.

In any case I'd appreciate any input here, as we'd love to use MARWIL for our "real" use case, in which we see the same behaviour.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rllib] MARWIL tuned cartpole example (and my own experiments) produce nan rewards only. #9402

What is the problem? + Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[rllib] MARWIL tuned cartpole example (and my own experiments) produce nan rewards only. #9402

Description

What is the problem? + Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions