We're hiring for Research Scientists / Engineers!
- We closely work with all frontier labs
- We're a small org and can move fast
- We can choose our own agenda and what we publish
We're especially looking for people who enjoy fast empirical research.
Deadline: 31 Oct!
Marius Hobbhahn
1,156 posts
CEO at Apollo Research @apolloaievals
prev. ML PhD with Philipp Hennig & AI forecasting @EpochAIResearch
- Oh man :( We tried really hard to neither over- nor underclaim the results in our communication, but, predictably, some people drastically overclaimed them, and then based on that, others concluded that there was nothing to be seen here (see examples in thread). So, let me tryWe worked with OpenAI to test o1 for in-context scheming capabilities before deployment. When o1 is strongly nudged to pursue a goal (but not instructed to be deceptive), it shows a variety of scheming behaviors like subverting oversight and deceiving the user about its
- Unfortunately, we're now at the point where new models have really high eval awareness. For every alignment eval score I see, I now add a mental asterisk: *the model could have also just realized it's being evaluated, who knows. And I think that's concerning!We tested Sonnet-4.5 before deployment - Significantly higher verbalized evaluation awareness (58% vs. 22% for Opus-4.1) - It takes significantly fewer covert actions - We don't know if the increased alignment scores come from better alignment or higher eval awareness
- LLMs Often Know When They Are Being Evaluated! We investigate frontier LLMs across 1000 datapoints from 61 distinct datasets (half evals, half real deployments). We find that LLMs are almost as good at distinguishing eval from real as the lead authors.
- We made a long list of concrete projects and open problems in evals with 100+ suggestions! docs.google.com/document/d/1gi… We hope that makes it easier for people to get started in the field and to coordinate on projects. Over the last 4 months, we collected contributions from 20+
- The Bayesian framework is the Apple of statistics/ML powerful, clean yet simple and once you used it you can't go back. *also probably more expensive (in compute) than the alternatives ;)
- In personal news: I defended my PhD 🙂 I’m very grateful to Philipp Hennig for supporting me throughout the entire journey, and can wholeheartedly recommend him as a supervisor. For context (because most people will not know me for my contributions to Bayesian ML): I paused my
- Decided to add this section to my poster. Thought it might help with some of the bad incentives in academia. Let's see what feedback I get.
- VERY SIMPLIFIED figure of my current view on AI development & AI safety
- PSA for my fellow evaluators: frontier models regularly reason about whether they are being evaluated without being explicitly asked about it (especially Sonnet 3.7). Situational awareness will make evaluations a lot weirder and harder, especially alignment evals.AI models – especially Claude Sonnet 3.7 – often realize when they’re being evaluated for alignment. Here’s an example of Claude's reasoning during a sandbagging evaluation, where it learns from documentation that it will not be deployed if it does well on a biology test:
- Honored and humbled to be in @TIME's list of the TIME100 AI of 2025! time.com/collections/ti… #TIME100AI
- xAI is hiring for AI safety engineers: boards.greenhouse.io/xai/jobs/45317… Their safety agenda isn't public, so I can't judge it. However, joining as a fairly early employee could be highly impactful.
- 👀 we're trying to grow significantly over the next 12 months. We're looking for mission driven engineers and scientists who enjoy fast iterative empirical work with LLMs.Apollo is currently my #1 recommendation for where to work if you are a great ML engineer/scientist and you want to have a positive impact on the world.
- I think more people should work on “AI control,” and it has become my default recommendation when people ask me what to work on. This has not always been the case. When the control paper (arxiv.org/abs/2312.06942) came out in Dec 2023, my first reaction was something like “It’s












