-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
questionJust a question :)Just a question :)
Description
What is the problem? + Reproduction
I have a custom example that produces offline data and picks it up with MARWIL for training. I observed that I get nan reward values for my example every time, so I went a step back and used your cartpole example:
I'm following the exact steps there, i.e. first run
./train.py --run=PPO --env=CartPole-v0 \
--stop='{"timesteps_total": 50000}' \
--config='{"output": "/tmp/out", "batch_mode": "complete_episodes"}'
followed by
rllib train -f cartpole-marwil.yaml
I did this both on my currently preferred stable version 0.8.5, as well as on the 0.9.0.dev0 wheel. The result is this:
== Status ==
Memory usage on this node: 19.4/32.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/12 CPUs, 0/0 GPUs, 0.0/9.96 GiB heap, 0.0/3.42 GiB objects
Result logdir: /Users/maxpumperla/ray_results/cartpole-marwil
Number of trials: 2 (2 TERMINATED)
+--------------------------------+------------+-------+--------+--------+------------------+--------+----------+
| Trial name | status | loc | beta | iter | total time (s) | ts | reward |
|--------------------------------+------------+-------+--------+--------+------------------+--------+----------|
| MARWIL_CartPole-v0_7af06_00000 | TERMINATED | | 0 | 2206 | 58.5661 | 500007 | nan |
| MARWIL_CartPole-v0_7af06_00001 | TERMINATED | | 1 | 2248 | 58.6117 | 500286 | nan |
+--------------------------------+------------+-------+--------+--------+------------------+--------+----------+
Also, I've noticed that your MARWIL unit test is a pure smoke test and doesn't check reward values, but I didn't run that locally. Maybe it produces nan values as well.
In any case I'd appreciate any input here, as we'd love to use MARWIL for our "real" use case, in which we see the same behaviour.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
questionJust a question :)Just a question :)