Lucas Schnack

Layers of Rock Paper Scissors

2026-03-07T17:03:27+01:00

This is a summary of my solution for the 2020-2021 "Rock, Paper, Scissors" competition hosted on Kaggle. Participants had to write bots that competed against each other in the well-known RPS game. Each matchup consisted of 1000 rounds, making this an excellent task for machine learning/statistics/AI.

Introduction

When you first hear of Rock, Paper, Scissors, you may wonder why this is an interesting game for a computer competition. There is simply no way to beat the Nash equilibrium strategy (playing completely randomly) statistically. However, this doesn't mean that there is no meaning in this challenge as the majority of opponents your bot may face isn't playing pure random. Therefore the goal is to try to win against these deterministic agents while trying not to lose against them at the same time (Playing deterministically is a double-edged sword.). I'll try to give a short high-level overview of the approach I chose for this task, which is an ensemble strategy, involving multiple meta-layers.

Policies

In my code, a policy/strategy/action selector is simply a mapping of the current game history to a probability distribution over the three possible actions. A policy selecting rock all the time would yield [1, 0, 0] for example. There are multiple ways to create new policies from existing ones of which I use the following:

Strict policies

This transformation takes the probabilities from a policy and selects the action with the highest probability, e.g. [0.6, 0.2, 0.2] becomes [1, 0, 0]. Therefore we now have a deterministic version of the original policy.

Incremented policies

"When my opponent thinks that I think he thinks I think..." Dan Egnor already implemented this idea in his Iocaine Powder bot. One takes a policy and shifts the probabilities by one to produce a version that would counter the current one. [0.2, 0.5, 0.3] becomes [0.3, 0.2, 0.5]. One can derive a double incremented policy by shifting the incremented version.

Counter policies

This is an attempt to construct a policy countering a given policy. The counter policy gets constructed by assuming the opponent uses the original policy (One has to flip one's own actions and the opponent's actions in the game history.) and shifting the probability distribution produced by the policy.

Combinations

As you can see, it is relatively easy to derive a whole set of policies from a single stochastic policy. By combining the above transformations, one can get up to twelve policies out of a single one.

List of base policies used in my final agent

Random
Frequency
Copy last action
Transition matrix
Transition tensor
Random forest model
two adapted stochastic versions of RFind
some fixed sequences (pi, e, De-Bruijn)
Seed searcher by @taaha-khan
RPS Geometry by @superant
Anti-Geometry bot by @robga
Greenberg by Andrzej Nagorko, translated into Python by Travis Erdman
Iocaine Powder by Dan Egnor, translated into Python by David Bau

and the following RPS Contest bots, which I could integrate into my ensemble using this method by Daniel Lu (@dllu):

testing please ignore by Daniel Lu
centrifugal bumblepuppy 4 by Daniel Lu
IO2_fightinguuu by sdfsdf
dllu1 by Daniel Lu
RPS_Meta_Fix by TeleZ
Are you a lucker? by sdfsdf
bayes14 by pyfex

Policy selector

I tried many ways of selecting the best policy for a move including weighted agent voting, Thompson sampling, a beta distribution, a Dirichlet distribution, LSTMs, CNNs, max wins, but one outperformed them all: argmax (which I didn't even consider until one week before the competition end date because it was so simple) applied to the decayed performance scores of the policies. Every step the performance of a policy is calculated like this: score = probabilities[winning_action] - probabilities[losing_action] My final version of the policy selector simply takes the three policies with the highest decayed performance scores and combines their probabilities using the weights [0.7, 0.2, 0.1], I really don't know why this performs so well, but well, it does.

Policy selector-selector

I added a meta-layer by creating multiple policy selectors with different parameters to calculate the performance score. The parameters are decay (used to differentiate between long- and short-term success), drop probability (the probability to reset the score of a policy when it loses; didn't perform too well in combination with a decay < 0.99) and zero clip (whether the score should be clipped at zero). Then the agent chooses best performing parameter configuration (measured by a decayed configuration performance score with additional hyperparameters) using argmax on the configuration weights. These weights are computed by sampling from a Dirichlet distribution using the configuration performance scores. At first, I tried combining the probabilities of the configurations weighting them by their respective performance, but this led to way too much noise. However, only configurations that yield a decisive result at this step get considered. If the result is too indecisive (The sum of the weights of all configurations yielding decisive probabilities is too small), my agent selects a uniformly random action instead of choosing from the distribution it got from the best parameter configuration.

Playing randomly

This is the last meta-layer in my agent and maybe the one with the biggest impact. Choosing a random action is a great defensive tool as there is simply no way to beat it. An additional decayed performance score gets computed for the agent. At this level, I no longer use a probability distribution for score calculation, but the action the agent selected. The agent chooses a random action with a certain probability in the following cases: * The probability of winning by playing random is >= 92% * The agent performance score is below a certain threshold (I tried playing the counter of the counter of the action selected by the agent with a certain probability, but this didn't perform too well.)

Observations

So that's it. Essentially just a strange cascade of scores and meta-layers operating on top of an ensemble of other agents. Interestingly, this architecture performs way better than previous ones with more meta-layers, more policies and more sophisticated methods for combining weighted probability distributions. This could be due to a bot essentially degenerating to random at a certain degree of complexity. The stochastic base policies, which are obviously strongly correlated to their strict versions, are almost never used by the ensemble. However, removing them resulted in a major decline in leaderboard performance. Removing almost all policies (except Greenberg and the Geometry bot by @superant), but keeping their counter policies results in a very strong agent, even outperforming the original version on the leaderboard. My interpretation of this is that this reduces noise that would distract the agent from cracking the opponent's strategy and minimal logic suffices (already demonstrated by @superant) apart from counter policies. In my local tests, using online-learned LSTMs as policy scoring functions vastly outperformed (up to 80 points) the approach described here, particularly because the agents with LSTMs are extremely fast at adapting to policy changes. On the leaderboard, however, this strategy performs worse. I'm convinced there is much room for further improvement using neural networks with a good architecture and maybe a combination of offline and online learning.

Regrets

Submitting almost all agents only once and only using a fraction of my submissions doesn't seem to be a good idea in a competition where a few matches against random agents can completely destroy the score of an agent.

Code

The code for this bot can be found on GitHub.

Conclusion

RPS has been an amazing competition and I'm sure I learned a lot from it. What I will remember most, however, is the incredible collaboration and helpfulness in the discussions section of this competition. Thanks to everyone who made this challenge the great experience it was. Of course, also special thanks to Kaggle for hosting this competition and providing excellent support.

Halite IV - Resource Management in Space

2026-03-07T17:03:27+01:00

Introduction

From July to September 2020, my teammate David Frank and I participated in the AI programming challenge "Halite by Two Sigma" hosted on Kaggle. I'll represent my team in this post to publish some information on our solution. First of all, we want to thank Kaggle and Two Sigma for hosting this competition and providing excellent support, which made the whole competition something special. Halite IV was the first programming competition we seriously participated in, so it was a genuinely new and exciting experience for us. We also want to thank the whole community for the friendly and helpful atmosphere, which made this highly competitive challenge truly a fun way to spend our free time on something interesting.

Game rules

You can find the full game rules on the Kaggle webpage, but we'll explain the key features of the game. - Halite is a four-player game taking place on a 21x21 grid map over 400 turns in which the players choose their actions simultaneously. - Each player starts with 5000 halite (the central resource in the game) and one ship - Ships can choose from six actions: they can move in four directions (north, east, south and west), stand still or convert into a shipyard for 500 halite - A shipyard can spawn one ship per turn for a fee of 500 halite. - Halite is placed symmetrically on many of the cells of the game map. - When a ship stays still on a cell with halite, it mines 25% of the cell's halite each turn receiving this amount as cargo. - Each turn halite cells with no ships on them regenerate 2% of their halite value up to a maximum of 500 halite. - Ships must return their cargo by moving on top of a shipyard to get their cargo transferred to the bank of the player from which the halite can be spent. - When multiple ships move to the same cell at the same turn, the ship with the least cargo survives and receives the cargo of the other ships which get destroyed. All ships get destroyed in case of a tie. - Ships crashing into enemy shipyards destroy themselves and the shipyard if the shipyard doesn't have a ship on top of it or spawns one. - When a player has no ships and shipyards left (or only shipyards and not enough halite to spawn a ship), he is immediately eliminated from the game. - At the end of the game the surviving players are ranked based on the halite in their bank.

Our rule-based approach

Parameters

First of all, I have to admit that our approach utilizes no fancy ML techniques or otherwise disproportionate complex algorithms. Instead, our bot is a 100% rule-based bot with lots of parameters. Initially, we considered using deep reinforcement learning but figured out the problem was not optimal for this approach because we couldn't find a way to make a player-based approach work. Therefore we settled for a bot with many parameters that would then be tuned by a simple evolutionary algorithm, which worked out quite well in the first half of the competition. Despite this, we discarded the evolutionary optimization near the end of the competition because of problems with overfitting and symmetry. Therefore the behaviour of our final bot is dictated by more than a hundred suboptimal parameters we chose more or less based on trial and error. These parameters control when we stop spawning ships, where to place our shipyards, how different factors influence mining and hunting, etc. (basically everything)

General strategy

On a high level, our bot aims to place shipyards in triangles around areas with a high population of halite cells (Not all cells contain halite. In fact most of them don't.) creating halite plantations. Then we wait for the halite to grow and mine on the cells when their halite values reach a certain harvest threshold. Shortly before the game finishes, we try to mine-out our whole plantation. Because of our dependence on shipyards, we do our best to protect our halite plantations, scaring enemies away, but we also aim to let a few enemies mine inside our farming zone and catch them afterwards causing ship losses among our opponents. At this point, I want to thank Fei Wang whose early success with his farming strategy, which also employs shipyards placed in triangles, made us recognize the impact of halite regeneration and the viability of this approach to the game.

Features

In the following, I will give an overview of the most important features of our bot while trying to maintain a balance between oversimplifying interesting stuff and going too much into detail when it comes to the boring specificities of the implementation. In general, we decided against programming a high-level framework for our bot because we felt like there were so many possible situations requiring special handling that we could accomplish our tasks best without much abstraction although this approach was very time-consuming, produced messy code and caused many bugs.

Stateless vs stateful agent

Until very late in the competition our agent was completely stateless, ignoring past turns at each step, and our final bot also keeps only very few information in between turns, e.g. positions of planned shipyards, past intruders in our region, the number of our shipyards. This way, we can reduce the impact of bad decisions in the past on newer ones.

Ship types

At the start of each turn, our bot assigns each ship a ship type based on different metrics. Our bot uses 10 ship types in total, but these are the most important: * MINING ships travel to halite-rich cells and mine halite. * RETURNING ships are ships with high cargo values that travel to a shipyard to deposit their cargo. * HUNTING ships try to crash into enemy ships, stealing their cargo. Most of them have a cargo of 0 halite. * GUARDING ships patrol along the borders of our region with the most enemies and protect shipyards. They also have 0 halite and when an enemy approaches our territory they become * DEFENDING ships that are HUNTING ships attacking enemies in regions dominated by us.

Scoring functions

Our bot uses different scoring functions, which get computed for each ship, in combination with linear sum assignments for basically everything. Mining ships get assigned the mining target that maximizes the sum of all mining scores, hunting ships choose the target with the highest hunting score, guarding ships patrol on the border with the highest guarding score, etc. This technique proved to be simple and quite effective.

Move preferences

At the end of each turn, we use a linear sum assignment to assign each ship an action simultaneously without any order. Specifically for every ship, we calculate a move preference score for each of the six available actions. We chose this approach because it allows us to handle special cases flexibly, give different ship types distinct priorities and easily avoid collisions with enemies (well, at least in theory). These move preference scores are affected by many factors other than the destination of a ship, e.g. the number of safe cells around a position, halite, farming positions on this axis, etc. By handling all ships simultaneously we could improve our bot significantly compared to our old approach, which computed a move order for all ships.

Mining

We built upon the formula in this excellent notebook by SolverWorld, but we tuned the approach a bit and added some factors to the calculation. For example, we delay the returning of our mining ships when there is still a lot of halite around them to improve mining. We also added many discount factors to express the uncertainty that comes with predicting future states and changed the mining scores of cells according to their respective safety, which describes how many of our ships are near the cell compared to our opponents taking into account how dangerous the ships are (Ships with less halite exert greater dominance). This safety matrix treats all opponents as a single one making our mining ships avoid regions of conflict and zones that are dominated by our opponents.

Hunting

Ships with bad mining scores and ships that are very likely to catch an enemy ship become hunting ships that try to destroy the ships of our opponents. The maximum amount of hunting ships targeting an enemy from one direction is limited, which forced hunting ships to approach their targets from multiple sides, increasing the probability of a catch. While our hunting score calculation is quite complex, the key factors are distance, player score, danger to our ships and shipyards, halite near them and most importantly dominance. For each position on the map, we calculate the dominance of a player as their ship advantage (taking cargo values into account) over the other players in the region which is quite similar to the safety matrix with the difference that we treat each player separate.

Chasing and intercepting targets

We boost the hunting scores of targets that can move safely (without risking to get destroyed or getting into situations that lead to them being surrounded by our ships) to fewer than two cells (especially for nearby hunting ships). In case a target can move safely in only one direction, we calculate possible interceptions. For this calculation, we simply check whether one of our ships can intercept the target, which is forced to keep moving in one direction, by reaching a cell in front of the target in time.

Guarding

Of course, guarding our halite plantations is one of the most important parts of our strategy. Therefore a certain proportion of our hunting ships is turned into guarding ships (those with the worst hunting scores). There are multiple factors like our player score and the number of enemies in our farming zone that get used to determine the exact proportion of hunting ships.

Shipyard guarding

We always try to have two ships that have at least the same distance to our shipyard as the two nearest enemies for all of our shipyards to defend them. Apparently, this simple algorithm works quite well when somebody tries to crash into our shipyard with only one ship, but often we cannot defend our shipyards against multiple ships attacking in a short time window as the following guarding ships often have to wait on cells which makes them mine halite, which is why they cannot defend the shipyard against ships with 0 halite.

Patrolling along borders

Our border guarding ships fulfil multiple purposes, the most crucial points being an increase of our dominance, which scares away opponents, and good positioning of 0 halite-ships around our plantation, which enables promising attacks on enemies within our borders. Similarly to the other scoring functions, we calculate a guarding score for every border position for every ship and maximize the combined scores. We improve the guarding scores of border positions near our ship, near shipyards and in regions with bad dominance values. Therefore our guarding ships actively improve the dominance in these regions and limit the number of enemies that get into our plantation.

Shipyard placement

Shipyards are easily one of the most critical aspects of our strategy as they are crucial for the efficient use of our plantation. Well-placed shipyards exert dominance, allow short return distances for mining ships and serve as an entry point into hostile territory. The most important factor for lucrative plantations is the halite population of an area, which is the number of cells that contain halite, as only cells with more than 0 halite produce halite. Therefore our bot calculates the next shipyard location each time our ship count exceeds different thresholds. After planning a new shipyard, we increase the hunting scores of enemy targets near the chosen shipyard site, slowly securing the position (if we succeed). Then we send two ships, one guarding ship that makes sure the newly constructed shipyard doesn't get immediately destroyed and a constructing ship that converts into a shipyard once it reaches the position. The way we choose our next shipyard position greatly varies with our current shipyard count, our player score, etc. - We place the first shipyard near clusters of halite-rich cells to get a good start in the game. - The second shipyard is also placed in a halite-rich region, but we also take the halite population in a circle that covers the area between our two shipyards into account. - We place the third shipyard so that the included area maximizes the halite population, but we also try not to place it near enemy shipyards of players that aren't completely out of the game. - After the third shipyard, we try to maximize the halite population and dominance values (using a very blurry dominance map).

Special regions

When we have more than one shipyard, we define different regions in the space between/around our shipyards: * Farming zone: These positions are the cells we use for growing and harvesting halite. * Minor farming zone: These are also farming positions, but they aren't as easy to protect as the real farming positions. Therefore we don't let them grow for as long as the others and don't penalize ships for travelling over them. * Guarding zone: The hunting scores of enemies on these positions get increased, making our bot take 1:1 trades when an enemy ship stays there for too long. * Border positions: This is simply a border around our plantations along which our guarding ships patrol.

Spawning

For ship spawning, we almost always use the simple rule: always spawn until step x; spawn ships at the shipyards with the lowest dominance to reinforce the defending forces

Some strategic considerations

Ship trades

Generally speaking, taking 1:1 trades by letting a ship crash into an enemy ship with equal cargo isn't the best thing to do in a four-player game because you weaken yourself and only one of the opponents. Despite this our bot often takes 1:1 trades for several reasons: - In the early game we greedily ignore to which cells enemy ships with equal cargo could move, essentially hoping that our opponent isn't as mad as we are and avoids crashes. - On the directly neighbouring cells of our shipyard, we also take 1:1 trades to defend our shipyards. - When we are far ahead in the game, our bot is also willing to take trades as the benefits of weakening our opponent outweigh the downsides of losing one of many ships. - Ships that stay on our farming plantations for too long without mining are also targeted by our hunting ships to remove them from our farming positions.

Due to our dependence on well-placed shipyards and the fact that we cannot simply move our stationary halite plantations as the exponential halite regeneration takes a lot of time to let halite grow, we are forced to defend our shipyards at all costs. Furthermore, we destroy enemy shipyards in our farming/guarding zone (unless we have many shipyards and can afford to lose one plantation). All in all, these two issues are a huge downside to our strategy as we often drain our material so much that we and the opponent who attacked us/was attacked by us finish in the last two places.

Player scores and their implications

Every turn we calculate different sets of player scores, taking into account different metrics, to see who is most likely to win, who has the biggest navy, who has the greatest map presence, etc. The resulting rankings get used for several decisions, for example, we don't want to place a shipyard next to a player that has way more ships, but we want to punish greedy bots with a low ship count that start banking up halite too early by attacking them. Another important point is that we boost the hunting scores for players that are competing for the same place as we are. For example, we attack the player with the second-highest player score when we are first while we try to at least finish in the third-place by attacking the worst bot besides us when we are far behind.

Tuning parameters

At first, we used a simple (poorly implemented) evolutionary algorithm to tune our parameters, but as our bot got better we couldn't use some quickly implemented bots with different strategies anymore and had to play only against our own bots. For a while we could still achieve quite okay-ish results by limiting the games to 110 turns, thus improving our early game, but it became clear that we couldn't overcome the problems with this approach, namely symmetry, overfitting and little noise leading to some bot snowballing and winning by chance. In the end, we manually tuned some parameters and wanted to wait until the last day of the competition to optimize them, but sadly I introduced a critical bug the day before, resulting in our final bots performing poorly.

Regrets and possible improvements

replay viewer with statistics and basic information like ship cargo, halite values, etc. (really wish we had programmed this)
pathfinding (Our bot uses close to no pathfinding resulting in ships blocking each other quite often.)
avoid positions where our ships get caught too easily
maybe a more stateful agent as our ships sometimes change their ship types too frequently
less parameters; sometimes magic numbers aren't that bad

Code

The code for our bot can be found on GitHub. However, the code quality is quite poor.

Conclusion

Our best bot peaking at the third place on the leaderboard (1505 elo) during the final week and our final result, the 4th place, completely exceeded our expectations and we are extremely satisfied with our results, especially since both of us basically just got out of school and have no professional or academic background. The last months have been a very interesting journey for us and we are very happy about how the competition turned out, the knowledge we gained and the great community we could count ourselves a part of.