Can the knowledge in language model representations guide the search for novel behaviors? We find that exploration with a simple, principled, representation-based bonus improves diversity and pass@k rates for inference-time and post-training!
Imitation learning is one of the most widely used methods in ML, but how does compute affect its performance?
We explore this question in the challenging game of NetHack and find our scaled-up agent to outperform prior SOTA by 2x!
arxiv.org/abs/2307.09423
[1/6]
How can RL agents deal with both sparse rewards and large, dynamic action spaces – a key challenge in text games?
Our method eXploit-Then-eXplore (XTX) tackles these challenges and achieves a more than 2x improvement on Zork!
arxiv.org/abs/2201.01251#ICLR2022 Spotlight
📜[1/5]
I’ll be at @NeurIPSConf this week! Feel free to reach out if you’d like to chat about anything scale in RL/IL, language agents (or broadly RL + NLP), or game theory!
More broadly, our results call for work in the larger IL and RL community to more carefully consider the role of scaling laws, which could provide large improvements in many other domains. Also check out prior work by @openai: arxiv.org/abs/2301.13442.
[5/6]
We train a suite of neural NetHack agents with different model sizes using Behavioral Cloning (BC) and analyze the loss and mean return isoFLOP profiles. We find both BC loss and mean return to follow clear power law trends with respect to FLOPs.
[3/6]
Using these power laws, we forecast the model and data size needed to train an agent aimed at recovering the underlying expert. While our agent falls short of expert performance, it sets a new SOTA (2.7K) in the unsolved game of NetHack, surpassing the prior best by 2x!
[4/6]
Prior works have found IL to consistently underperform the data-generating policy. However, these works often overlook the role of compute in terms of model and data size. Inspired by work around LLMs, we see if scaling up IL can provide similar performance gains.
[2/6]
XTX employs a two-stage rollout in each episode to tackle these:
(1) An *exploitation* policy trained on promising past trajectories returns to the frontier.
(2) An *exploration* policy that uses past experience and curiosity explores the frontier.
[3/5]
XTX outperforms several competitive baselines across 12 games in the Jericho benchmark (avg norm. scores across games in fig) in both the deterministic and stochastic setting, showing the strength of our multi-stage approach with strategic exploration at the frontier.
[4/5]