here is my thesis “Safe Automated Research”
i worked on 3 approaches to make sure we can trust the output of automated researchers as we reach this new era of science
it was a very fun PhD
In the spirit of making more real world evals, here is the Factorio Learning Environment (FLE).
Spurred by wanting to eval if models are good paperclip maximisers, we check how well agents build factories for other things 🏗️🏭🛠️
How can we check LLM outputs in domains where we are not experts?
We find that non-expert humans answer questions better after reading debates between expert LLMs.
Moreover, human judges are more accurate as experts get more persuasive. 📈
github.com/ucl-dark/llm_d…
How can we check LLM outputs in domains where we are not experts?
We find that non-expert humans answer questions better after reading debates between expert LLMs.
Moreover, human judges are more accurate as experts get more persuasive. 📈
github.com/ucl-dark/llm_d…
I’m recruiting Fellows to work with me on Aligning Superhuman models. My associated fellows will work on thinking about what honesty, values and alignment is. I need people who:
- get that models are gonna be smarter than us
- are opinionated on deciphering human intuition
-
We’re starting a Fellows program to help engineers and researchers transition into doing frontier AI safety research full-time.
Beginning in March 2025, we'll provide funding, compute, and research mentorship to 10–15 Fellows with strong coding and technical backgrounds.
In general models get bottlenecked on two things:
1) Planning (models end up exhausting their resources)
2) Spatial reasoning (to plan efficient factory topologies)
GPT4o-Mini even asked us to turn it off at one point because it was unrecoverable 🥹
Your ability to build is dependent on how much you currently produce, so small differences in model capabilities really compound!
Sonnet-3.6 produces 10x more resources than GPT-4-Mini by the end of our play.
We provide a programatic interface where agents are able to engage with the game via code. This lets us evaluate how good code-based agents are at planning, interacting with environments and building complex systems
Factorio is a resource management and automation game where the goal is to build the largest factory
Agents are dropped onto a fresh world and begin collecting resources, investing in technology and building factories to create more complex resources.
The awesome thing about factorio is that there is no upperbound on how many resources can you produce, and the technology tree is infinite [1].
This means the eval should not saturate, (we'd expect reward hacks before then)
[1]