Who
Mohammad Amoush (mamoush)
Aryaman Dutta (adutta3)
Sam Wilkins (swilkin5)
README
The markdown that follows is our original project post. Please refer to the final reflection for updated analysis.
Introduction
We'd like to delve into deep reinforcement learning by implementing a model that can perform comparably well to an intermediate-level human player in the four-player, predominantly Middle-Eastern card game Trex. The game is relatively straightforward for a human to learn, but does have a very specific and comparatively complex structure containing a fixed number of sub-games. There are 4 Kingdoms each consisting of 5 contracts, and, glossing over the nature of the Trex contract, each contract is effectively 13 tricks. Each one requires strategy to consistently perform well. We find it a compelling challenge to tackle because at any given time (particularly early in a round, or at any point if one doesn't count cards) the game state is one of imperfect information. Thus, it will be interesting to see how / to what extent a model will be able to cope with the combination of, at varying times, both very constrained choices and more arbitrary actions ('hedging its bets'). Finally, the fact that the model must play to better both itself and its partner, whose hand is unknown but can be the subject of inference over time, adds an additional layer to this goal.
Related Work
- Python ISMCTS
- DeepStack
- Deep Reinforcement Learning from Self-Play in Imperfect-Information Games
- Application of Self-Play Reinforcement Learning to a Four-Player Game of Imperfect Information
- Playing Cards with Reinforcement Learning
(1) This paper proposes an algorithm dubbed Neural Fictitious Self Play (NFSP) to combine the existing technique of fictitious self play with the capabilities of neural network function approximation. In the pseudocode provided, "all players of the game are controlled by separate NFSP agents that learn from simultaneous play against each other, i.e. self-play." Further, "an NFSP agent interacts with its fellow agents and memorizes its experience of game transitions and its own best response behavior in two memories," one dedicated to storing data that will be used to train a neural network with off-policy reinforcement learning to determine the "best strategy" β, the other storing data that will be used to train a supervised classification network determining the "average strategy" π. In short, the model's eventual strategies will be a combination of these two. The publication indicates that such NFSP agents are much less exploitable than both average and greedy DQN models.
Data and Methodology
It is our estimation that the amount of recorded data required to actually provide a set sufficient for supervised learning would be beyond the scope of this project. Thus, we intend to stick to the reinforcement learning pipeline of self play. Ideally, we will take a gradual approach. We have already implemented naive players that can play full games engaging with learnable behaviors simply by taking random actions. Next, we will aim to implement regular AI agents (Minimax, Q-Learning are candidates) that can routinely outperform the initial set of naive players. Finally, having designed our deep reinforcement model (DQN, REINFORCE, NFSP, etc. are candidates), we will begin by training the single model against three AI agents repeatedly. Depending on how well this goes, and the extent of the ultimate compute resources required, the process may culminate in training 1 ≤ n ≤ 4 reinforcement models (each with slightly different hyperparametrizations) to play (4 - n) AI agents. In the event that any of these steps seems entirely disproportionate to the scope of the project, we may simply attempt to restrict the model to performing in the context of a single contract (King of Hearts, Slaps, Queens, Diamonds or Trex), as, conveniently, these are all independent games.
Metrics
Ultimately, the outcome of a single contract is entirely dependent on the (card) hands and skillsets of both a player and his or her partner, and thus even over twenty contracts in a game, individual partner and total score variance might be quite high. However, in the limit, the variance in hands drawn between players (over the course of, say, 1000 games) will likely become trivial, and thus (while it is possible to determine rewards with finer granularity for the sake of training), we believe that the clearest metric for success at the moment would be the relative long term performance of the model relative first to naive players, then AI players, then against other versions of itself, and of course against any humans that fall into the skill levels presumed to be represented by each of these three simulated player classes.
Because game-to-game scores depend so much on the random hands of both the individual and the partner, and the skill level of the partner, it does not seem meaningful to treat the total score of a single game as the metric of success. Thus, following the training escalation described in the previous section, to mitigate this variance we intend to self-play as many games as computationally possible, and use statistically significant trends in either individual score, combined partner score, win-loss rate or some currently undetermined metric (all relative to naive, AI and human players) to determine the model's success.
At the end of the day, in casual terms, it would be fantastic to observe the model beat (or prevail with some frequency better than chance over a currently undetermined sufficiently large number of games) two average-skilled opponents while partnered with a below average player (whether human or simulated naive). This would indicate the model's command of both its own gameplay and its ability to assist a partner with "inference".
Ethics
Why is Deep Learning a good approach to this problem? Because this game consists of well-defined game states and game sequences, we believe it will map reasonably well to deep reinforcement learning. Importantly, however, the imperfectness of the information combined with the ability to simply draw poor cards, as well as the need to advance the score of a partner as well, leads to potentially very sophisticated strategies that would be impossible to encode in a tabular or heuristic manner.
What broader societal issues are relevant to your chosen problem space? Gameplay is certainly significant in its own right, but more broadly, the ability for models to enact sophisticated cooperative policies / strategies under imperfect information is undoubtedly applicable to many of the challenges with which many cutting-edge models engage. These might range from (semi-)autonomous vehicles to other complex task-based risk / reward scenarios.
Division of Labor
This section will need to be updated within the coming days. As things stand, the baseline repository with naive players has been completed. Thus, what primarily remains is the design of the AI and Deep Learning models. Earlier on, it might be suitable to allocate work along lines of game agent type, so that each group member might be able to more quickly present a new player type that can be used in the various training permutations mentioned above. Specifically, it might be suitable to each familiarize ourselves with one type of model and present to the group an assessment of its merits and challenges when applied to this problem. After this point, we will select an architecture and suitably repartition labor. Mohammad has expressed interest in spearheading the development of a user interface should the model's success and the remaining time permit.
Log in or sign up for Devpost to join the conversation.