Bartłomiej Cupiał (@CupiaBart) / X

Bartłomiej Cupiał

131 posts

Bartłomiej Cupiał

@CupiaBart

PhD Student @ University of Warsaw | @IDEAS_NCBR bartekcupial.github.io

Warsaw, Poland

Joined May 2019

Pinned
Bartłomiej Cupiał
@CupiaBart
May 24, 2024
So here's a story of, by far, the weirdest bug I've encountered in my CS career. Along with @maciejwolczyk we've been training a neural network that learns how to play NetHack, an old roguelike game, that looks like in the screenshot. Recenlty, something unexpected happened.
2.2M
Bartłomiej Cupiał
@CupiaBart
May 24, 2024
Replying to @CupiaBart
The moral is, if you encounter an unexpected bug, be sure to consult lunar calendar. Big thanks to @JensTuyls for solving this for us!
87K
Bartłomiej Cupiał
@CupiaBart
May 24, 2024
Replying to @CupiaBart
So apparently NetHack has a mechanic that slightly changes how the game plays every time it's full moon according to your system clock: nethackwiki.com/wiki/Time The player character is luckier, werewolves appear in their animal form, and the dogs howl ominously.
86K
Bartłomiej Cupiał
@CupiaBart
May 24, 2024
Replying to @CupiaBart
Next day in the morning I see a lot of messages on slack. Jens replied "Oh yes, it's probably a full moon today." What.
88K
Bartłomiej Cupiał
@CupiaBart
May 24, 2024
Replying to @CupiaBart
It doesn't make the game harder, but the model hasn't seen full moon data in its training set, so the score drops. In this particular case, it drops from 5k points to 3k points. We override the time so it's not a full moon, we evaluate the model - and it's 5k points again.
83K
Bartłomiej Cupiał
@CupiaBart
May 24, 2024
Replying to @CupiaBart
I check a moon phase calendar, and yes, it's a full moon today. Hands shaking, I start a new NetHack game, and the message says "You are lucky! Full moon tonight." What.
84K
Bartłomiej Cupiał
@CupiaBart
May 24, 2024
Replying to @CupiaBart
By the point we've spent several hours on this, it's 7 PM. I am starting to feel like a madman. I can't even watch a TV show constantly thinking about the bug. Before going to sleep I decide to ask @JensTuyls, the author of the model, if he knows what might be broken.
84K
Bartłomiej Cupiał
@CupiaBart
May 24, 2024
Replying to @CupiaBart
Namely, the CUDA libraries that allow us to compute things quickly on GPU. So we suspect that maybe something about these libraries changed that degraded the model. Because what else could have? And yes, recently the version was changed from 11.8 to 12.4.
90K
Bartłomiej Cupiał
@CupiaBart
May 24, 2024
Replying to @CupiaBart
We use a model by @JensTuyls that clones expert behavior on NetHack, and we improve it using RL methods. That model gets 5000 points and we finetune it in the game so that the score improves. However, suddenly in a recent run, Jens' model only got 3000 points. Quite a drop.
105K
Bartłomiej Cupiał
@CupiaBart
May 24, 2024
Replying to @CupiaBart
Revert code a few weeks back? Still 3000 points. Luckily, the server we run our experiments on saves the files from the previous runs. We find the files corresponding to a run that previously got 5000 points, we re-run, and, well, it gets 3000. Nothing about the code changed.
98K
Bartłomiej Cupiał
@CupiaBart
May 24, 2024
Replying to @CupiaBart
The CUDA mismatch probably shouldn't impact the results in this particular way, but we see no other explanation. We override the version to 11.8 - we still get 3000 points. We build a new environment from scratch, for CUDA 12.4 - 3000 points. Welp.
89K
Bartłomiej Cupiał
@CupiaBart
May 24, 2024
Replying to @CupiaBart
We repeat the evaluation on a personal laptop. This is slow and expensive without the specialized hardware, but we make it work. Again, 3000 points. We disable multithreading, GPU, and some other things that have at least a conceivable chance of causing the problem - 3000 points.
85K
Bartłomiej Cupiał
@CupiaBart
May 24, 2024
Replying to @CupiaBart
We start suspecting our software stack. Thankfully, we use Singularity which means that our whole environment is in a single, self-contained file. That file hasn't changed for a few months, so that shouldn't be the problem. However, the container loads one thing from the server.
93K
Bartłomiej Cupiał
@CupiaBart
May 24, 2024
Replying to @CupiaBart
This problem is consistent between seeds so it's not just a fluke. Well, we probably screwed up something in the code for loading the model in the recent commit. Let's revert, no biggie. Except that after reversing to a version of the code from a few days back, we still get 3000.
101K