Deep Double Q-Learning

Poster
Video Pinball
Space Invaders - Stabilizing
Space Invaders - Reward Crash

By: Evan Dong (zdong6), Hwai-Liang Tung (htung1)

Introduction:

For this project, we reimplemented the paper “Deep Reinforcement Learning with Double Q-learning” (https://arxiv.org/abs/1509.06461). The “RL conclusion” lecture covered this briefly as well. Traditional deep Q-learning can sometimes overestimate Q-values, leading to suboptimal performance. This paper shows that double q-learning techniques can be applied to deep q-learning to reduce overestimation of Q-values, achieving (at the time) state-of-the-art performance.

Methodology:

We implemented this architecture with TensorFlow and OpenAI Gym in Python, using the Atari/Arcade Learning Environment. We used this environment to train a standard deep Q-learning algorithm as well as our deep double-Q learning network. Within TensorFlow, we primarily used Keras layers, including our own DQN/DDQN layers, and also made a class/object for experience replay. We tested and trained both algorithms on Space Invaders and Video Pinball.

Results:

We found it difficult to draw conclusions. Performance was variable, as training RL is often finicky. Moreover, we had a limited amount of time in training and testing, as we ran this on personal computers without GPUs, so we did not perform any statistical tests or analyses of results across a large number of trials. Nonetheless, we were able to generate a few visualizations and infer from a few instances of 100 episodes each.

Space Invaders:

As expected, we did not see obvious improvements in DDQN over DQN, except when one or both algorithms crashed and became stuck at zero. This fits the results from Hasselt et al., who report highly similar results between DQN and DDQN given the same hyperparameters (tuned to DQN) for this game.

Video Pinball:

We found visible, significant differences in results for Video Pinball, however. While still showing highly erratic performance between episodes, DDQN still clearly outperformed the original DQN algorithm. Admittedly, it’s difficult to say exactly how large this improvement is; compared to the Hasselt et al. paper, which shows an improvement of up to +2500% (in different circumstances; those are tested with human start), this difference is not quite as large.

Challenges:

Figuring out how to set up and work with the Atari Arcade environment was surprisingly difficult. Unlike the rest of the programming we’ve had to do in the class, we actually had to figure out how to install the packages and make sure they were correctly used in the environment. Additionally, debugging and putting together the models and getting them to train correctly was also somewhat harder. Because we didn’t actually implement Q-learning or Deep Q-learning in this class, translating our understanding of the algorithm into code took some self-teaching and digging into new error messages. We also didn’t have nicely suggested hyperparameters and architectures for our networks like we did for homework, so tuning and getting meaningful results was also very time consuming.

Reflection:

The project was ultimately only partially successful. We did meet our base and target goals in some capacity, in the sense that we tested the algorithms on games we wanted to test. However, we did not see as clear of a large increase in performance that was expected in the deep double-Q learning algorithm, and it was difficult to measure exactly. Moreover, rewards continued to vary a lot between episodes; results were unstable. Our approach did not change too much, aside from scrapping the stretch goal ideas of testing resilience to different starting points, as we were unsure of how to implement that. We would also like to observe the results of applying our model to a wider variety of Atari games. With more time, we would likely try to better fine-tune the hyperparameters in DQN/DDQN, and potentially experiment with more complex architectures - for example, using convolution on the input images of game states. We'd also like to learn about and compare DDQN to policy gradient networks. Overall, we learned how to better plan and put together deep learning architectures for a given problem without a convenient framework or stencil.

Built With

python
tensorflow

Updates

Evan Dong started this project — Nov 15, 2020 03:58 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.