DeepTrex

Project Overview
EmpiricalCardSelector Architecture

Who

Mohammad Amoush (mamoush)
Aryaman Dutta (adutta3)
Sam Wilkins (swilkin5)

README

The markdown that follows is our original project post. Please refer to the final reflection for updated analysis.

Introduction

We'd like to delve into deep reinforcement learning by implementing a model that can perform comparably well to an intermediate-level human player in the four-player, predominantly Middle-Eastern card game Trex. The game is relatively straightforward for a human to learn, but does have a very specific and comparatively complex structure containing a fixed number of sub-games. There are 4 Kingdoms each consisting of 5 contracts, and, glossing over the nature of the Trex contract, each contract is effectively 13 tricks. Each one requires strategy to consistently perform well. We find it a compelling challenge to tackle because at any given time (particularly early in a round, or at any point if one doesn't count cards) the game state is one of imperfect information. Thus, it will be interesting to see how / to what extent a model will be able to cope with the combination of, at varying times, both very constrained choices and more arbitrary actions ('hedging its bets'). Finally, the fact that the model must play to better both itself and its partner, whose hand is unknown but can be the subject of inference over time, adds an additional layer to this goal.

Related Work

(1) This paper proposes an algorithm dubbed Neural Fictitious Self Play (NFSP) to combine the existing technique of fictitious self play with the capabilities of neural network function approximation. In the pseudocode provided, "all players of the game are controlled by separate NFSP agents that learn from simultaneous play against each other, i.e. self-play." Further, "an NFSP agent interacts with its fellow agents and memorizes its experience of game transitions and its own best response behavior in two memories," one dedicated to storing data that will be used to train a neural network with off-policy reinforcement learning to determine the "best strategy" β, the other storing data that will be used to train a supervised classification network determining the "average strategy" π. In short, the model's eventual strategies will be a combination of these two. The publication indicates that such NFSP agents are much less exploitable than both average and greedy DQN models.

Data and Methodology

It is our estimation that the amount of recorded data required to actually provide a set sufficient for supervised learning would be beyond the scope of this project. Thus, we intend to stick to the reinforcement learning pipeline of self play. Ideally, we will take a gradual approach. We have already implemented naive players that can play full games engaging with learnable behaviors simply by taking random actions. Next, we will aim to implement regular AI agents (Minimax, Q-Learning are candidates) that can routinely outperform the initial set of naive players. Finally, having designed our deep reinforcement model (DQN, REINFORCE, NFSP, etc. are candidates), we will begin by training the single model against three AI agents repeatedly. Depending on how well this goes, and the extent of the ultimate compute resources required, the process may culminate in training 1 ≤ n ≤ 4 reinforcement models (each with slightly different hyperparametrizations) to play (4 - n) AI agents. In the event that any of these steps seems entirely disproportionate to the scope of the project, we may simply attempt to restrict the model to performing in the context of a single contract (King of Hearts, Slaps, Queens, Diamonds or Trex), as, conveniently, these are all independent games.

Metrics

Ultimately, the outcome of a single contract is entirely dependent on the (card) hands and skillsets of both a player and his or her partner, and thus even over twenty contracts in a game, individual partner and total score variance might be quite high. However, in the limit, the variance in hands drawn between players (over the course of, say, 1000 games) will likely become trivial, and thus (while it is possible to determine rewards with finer granularity for the sake of training), we believe that the clearest metric for success at the moment would be the relative long term performance of the model relative first to naive players, then AI players, then against other versions of itself, and of course against any humans that fall into the skill levels presumed to be represented by each of these three simulated player classes.

Because game-to-game scores depend so much on the random hands of both the individual and the partner, and the skill level of the partner, it does not seem meaningful to treat the total score of a single game as the metric of success. Thus, following the training escalation described in the previous section, to mitigate this variance we intend to self-play as many games as computationally possible, and use statistically significant trends in either individual score, combined partner score, win-loss rate or some currently undetermined metric (all relative to naive, AI and human players) to determine the model's success.

At the end of the day, in casual terms, it would be fantastic to observe the model beat (or prevail with some frequency better than chance over a currently undetermined sufficiently large number of games) two average-skilled opponents while partnered with a below average player (whether human or simulated naive). This would indicate the model's command of both its own gameplay and its ability to assist a partner with "inference".

Ethics

Why is Deep Learning a good approach to this problem? Because this game consists of well-defined game states and game sequences, we believe it will map reasonably well to deep reinforcement learning. Importantly, however, the imperfectness of the information combined with the ability to simply draw poor cards, as well as the need to advance the score of a partner as well, leads to potentially very sophisticated strategies that would be impossible to encode in a tabular or heuristic manner.

What broader societal issues are relevant to your chosen problem space? Gameplay is certainly significant in its own right, but more broadly, the ability for models to enact sophisticated cooperative policies / strategies under imperfect information is undoubtedly applicable to many of the challenges with which many cutting-edge models engage. These might range from (semi-)autonomous vehicles to other complex task-based risk / reward scenarios.

Division of Labor

This section will need to be updated within the coming days. As things stand, the baseline repository with naive players has been completed. Thus, what primarily remains is the design of the AI and Deep Learning models. Earlier on, it might be suitable to allocate work along lines of game agent type, so that each group member might be able to more quickly present a new player type that can be used in the various training permutations mentioned above. Specifically, it might be suitable to each familiarize ourselves with one type of model and present to the group an assessment of its merits and challenges when applied to this problem. After this point, we will select an architecture and suitably repartition labor. Mohammad has expressed interest in spearheading the development of a user interface should the model's success and the remaining time permit.

Built With

numpy
python
tensorflow

Updates

Sam Wilkins posted an update — Nov 23, 2020 07:07 PM EST

Introduction:

Challenges:

At this point in time, our resources have essentially been exclusively dedicated to appropriately modeling Trex in Python and, correspondingly, implementing a non-deep AI player that outperforms players that select from legal cards at random. Thus, this is where the bulk of the challenges have arisen. While we initially established a repository that contained a particular model decomposing the game into stateful players and stateless “contracts” that operate on those players, it became difficult to apply this approach to the Monte Carlo search algorithm that, it recently became apparent, we would need to support an AI player in a game of imperfect information. The next challenge was then to adapt an ISMCTS python implementation (Information Set Monte Carlo Tree Search) originally designed for Knockout Whist to Trex. The adaptations required meaningful contributions and edits, both in game logic and selective conversion to Numpy for performance optimization. The new method of modeling the game, dividing the ‘world’ into polymorphic game states used to construct the game tree, and stateless agents that perform move selection, seems very promising for our transition into the Deep Learning portion of the project.

Another primary dimension of difficulty we've been working through has been planning our architecture for the actual Deep Learning model, discussed below.

Insights:

We don’t have any concrete results that we can show on the deep model. However, as mentioned above, we currently have a baseline implementation of the game that we’re solving, as well as an AI agent that can beat naive (random) players with statistical significance over the course of dozens of games.

Plan:

We’re on track with the project. Our current priority is to learn more about the model we have chosen, which is going to be DeepStack. We’re finalizing our understanding of the model, and will start to implement the deep model as of tomorrow. Since we're done with our AI models that we’ll use as a baseline metric, the only milestone left on the horizon is to implement and train the Deep Learning model.

Log in or sign up for Devpost to join the conversation.

Sam Wilkins started this project — Nov 14, 2020 06:16 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.