David Lindner (@davlindner) / X

David Lindner

210 posts

David Lindner

@davlindner

Making AI safer @GoogleDeepMind

London, UK

Joined April 2012

Pinned
David Lindner
@davlindner
May 29
Will your AI agent secretly sabotage your work? Existing alignment evals don't directly answer this question Meet Gram: the alignment auditing tool we use to assess how likely AI agents are to engage in sabotage during internal deployments at @GoogleDeepMind
13K
David Lindner
@davlindner
Jan 23, 2025
New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵
159K
David Lindner
@davlindner
Jan 13, 2023
Did you ever want to build your own transformer models from scratch? Well, now you can! We introduce Tracr to translate code written in RASP (a domain-specific language for transformers) into weights of a GPT-like model: arxiv.org/abs/2301.05062 Why you’d want to do this? 🧵👇
arxiv.org
Tracr: Compiled Transformers as a Laboratory for Interpretability
We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design...
138K
David Lindner
@davlindner
Feb 10, 2025
Want to join one of the best AI safety teams in the world? We're hiring @GoogleDeepMind! We have open positions for research engineers and research scientists in the AGI Safety & Alignment and Gemini Safety teams. Locations: London, Zurich, New York, Mountain View and SF
38K
David Lindner
@davlindner
Sep 23, 2023
Very happy to share that I recently joined @GoogleDeepMind's Alignment team! In this new role, I'm aiming to help ensure that the next generation of AI systems is developed responsibly, and that AI will become more interpretable and trustworthy.
13K
David Lindner
@davlindner
Jul 17, 2022
Our #ICML2022 paper argues that representing learned human preferences as constraints might be a more robust alternative to only representing them as reward functions. This motivates looking at actively learning about unknown constraints from (expensive) human feedback. 🧵👇
David Lindner
@davlindner
Jan 23, 2025
Replying to @davlindner @seb_far and 3 others
Key idea: Use RL training only for short horizons (myopic optimization), but have an overseer evaluate how good actions are for the long term (non-myopic approval). Best of both worlds: we get human-understandable plans (safe!) and long-term planning (performant!)
4.9K
David Lindner
@davlindner
Jun 16, 2022
I'm excited to share that I will spend this summer as a research intern at DeepMind's Scalable Alignment team! I will move to London next week; looking forward to meeting old/new friends over there😊
David Lindner
@davlindner
Aug 31, 2023
Yesterday, I successfully defended my doctoral thesis 🎓🎉 Thanks to everyone who came to the defense and made it a special experience 😊 The past few years were an incredible chapter of my life. While it's a bit sad to close it now, I'm looking forward to opening the next one!
1.6K
David Lindner
@davlindner
Dec 6, 2024
New paper on evaluating instrumental self-reasoning ability in frontier models 🤖🪞 We propose a suite of agentic tasks that are more diverse than prior work and give us a more representative picture of how good models are at eg. self-modification and embedded reasoning
2.6K
David Lindner
@davlindner
Jan 23, 2025
Replying to @davlindner
What happens if the agent tries a multi step reward hack that the overseer can’t detect? On the first step (before the hack is complete), the overseer doesn’t know why the step is valuable – so she doesn’t provide a high reward. So the first step isn’t incentivized by MONA.
3.6K
David Lindner
@davlindner
Jan 13, 2023
Replying to @davlindner
We believe Tracr can help to accelerate interpretability research. Often it can be difficult to check if the explanation an interpretability tool provides is correct. Tracr allows us to create a range of ground truth models in which we can confirm that our tools work!
4.4K
David Lindner
@davlindner
Jan 23, 2025
Replying to @davlindner
Want to dive deeper? Check out our paper and our blog posts explaining the work in more detail 📄 Paper: arxiv.org/abs/2501.13011 💡 Introductory explainer: deepmindsafetyresearch.medium.com/mona-a-method-… ⚙️ Technical safety post: alignmentforum.org/posts/zWySWKuX…
3.7K
David Lindner
@davlindner
Mar 21, 2024
Glad to share the first project I've been working on since joining @GoogleDeepMind. In this report, we present novel evaluations to test frontier models for potentially dangerous capabilities:
Toby Shevlane
@tshevl
Mar 21, 2024
In 2024, the AI community will develop more capable AI systems than ever before. How do we know what new risks to protect against, and what the stakes are? Our research team at @GoogleDeepMind built a set of evaluations to measure potentially dangerous capabilities: 🧵
arxiv.org
Evaluating Frontier Models for Dangerous Capabilities
To understand the risks posed by a new AI system, we must understand what it can and cannot do. Building on prior work, we introduce a programme of new "dangerous capability" evaluations and pilot...
2.7K