Skip to content

[rllib] MARWIL tuned cartpole example (and my own experiments) produce nan rewards only. #9402

@maxpumperla

Description

@maxpumperla

What is the problem? + Reproduction

I have a custom example that produces offline data and picks it up with MARWIL for training. I observed that I get nan reward values for my example every time, so I went a step back and used your cartpole example:

https://github.com/ray-project/ray/blob/cd5a207d69cdaf05b47d956c18e89d928585eec7/rllib/tuned_examples/marwil/cartpole-marwil.yaml

I'm following the exact steps there, i.e. first run

./train.py --run=PPO --env=CartPole-v0 \
    --stop='{"timesteps_total": 50000}' \
    --config='{"output": "/tmp/out", "batch_mode": "complete_episodes"}'

followed by

rllib train -f cartpole-marwil.yaml

I did this both on my currently preferred stable version 0.8.5, as well as on the 0.9.0.dev0 wheel. The result is this:

== Status ==
Memory usage on this node: 19.4/32.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/12 CPUs, 0/0 GPUs, 0.0/9.96 GiB heap, 0.0/3.42 GiB objects
Result logdir: /Users/maxpumperla/ray_results/cartpole-marwil
Number of trials: 2 (2 TERMINATED)
+--------------------------------+------------+-------+--------+--------+------------------+--------+----------+
| Trial name                     | status     | loc   |   beta |   iter |   total time (s) |     ts |   reward |
|--------------------------------+------------+-------+--------+--------+------------------+--------+----------|
| MARWIL_CartPole-v0_7af06_00000 | TERMINATED |       |      0 |   2206 |          58.5661 | 500007 |      nan |
| MARWIL_CartPole-v0_7af06_00001 | TERMINATED |       |      1 |   2248 |          58.6117 | 500286 |      nan |
+--------------------------------+------------+-------+--------+--------+------------------+--------+----------+

Also, I've noticed that your MARWIL unit test is a pure smoke test and doesn't check reward values, but I didn't run that locally. Maybe it produces nan values as well.

In any case I'd appreciate any input here, as we'd love to use MARWIL for our "real" use case, in which we see the same behaviour.

Metadata

Metadata

Assignees

Labels

questionJust a question :)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions