Log inSign up
David Lindner
210 posts
user avatar
David Lindner
@davlindner
Making AI safer @GoogleDeepMind
London, UK
davidlindner.me
Joined April 2012
347
Following
1,807
Followers
  • Pinned
    user avatar
    David Lindner
    @davlindner
    May 29
    Will your AI agent secretly sabotage your work? Existing alignment evals don't directly answer this question Meet Gram: the alignment auditing tool we use to assess how likely AI agents are to engage in sabotage during internal deployments at @GoogleDeepMind
    13K
  • user avatar
    David Lindner
    @davlindner
    Jan 23, 2025
    New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵
    159K
  • user avatar
    David Lindner
    @davlindner
    Jan 13, 2023
    Did you ever want to build your own transformer models from scratch? Well, now you can! We introduce Tracr to translate code written in RASP (a domain-specific language for transformers) into weights of a GPT-like model: arxiv.org/abs/2301.05062 Why you’d want to do this? 🧵👇
    arXiv logo
    arxiv.org
    Tracr: Compiled Transformers as a Laboratory for Interpretability
    We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design...
    138K
  • user avatar
    David Lindner
    @davlindner
    Feb 10, 2025
    Want to join one of the best AI safety teams in the world? We're hiring @GoogleDeepMind! We have open positions for research engineers and research scientists in the AGI Safety & Alignment and Gemini Safety teams. Locations: London, Zurich, New York, Mountain View and SF
    38K
  • user avatar
    David Lindner
    @davlindner
    Sep 23, 2023
    Very happy to share that I recently joined @GoogleDeepMind's Alignment team! In this new role, I'm aiming to help ensure that the next generation of AI systems is developed responsibly, and that AI will become more interpretable and trustworthy.
    13K
  • user avatar
    David Lindner
    @davlindner
    Jul 17, 2022
    Our #ICML2022 paper argues that representing learned human preferences as constraints might be a more robust alternative to only representing them as reward functions. This motivates looking at actively learning about unknown constraints from (expensive) human feedback. 🧵👇
  • user avatar
    David Lindner
    @davlindner
    Jan 23, 2025
    Replying to @davlindner @seb_far and 3 others
    Key idea: Use RL training only for short horizons (myopic optimization), but have an overseer evaluate how good actions are for the long term (non-myopic approval). Best of both worlds: we get human-understandable plans (safe!) and long-term planning (performant!)
    4.9K
  • user avatar
    David Lindner
    @davlindner
    Jun 16, 2022
    I'm excited to share that I will spend this summer as a research intern at DeepMind's Scalable Alignment team! I will move to London next week; looking forward to meeting old/new friends over there😊
  • user avatar
    David Lindner
    @davlindner
    Aug 31, 2023
    Yesterday, I successfully defended my doctoral thesis 🎓🎉 Thanks to everyone who came to the defense and made it a special experience 😊 The past few years were an incredible chapter of my life. While it's a bit sad to close it now, I'm looking forward to opening the next one!
    1.6K
  • user avatar
    David Lindner
    @davlindner
    Dec 6, 2024
    New paper on evaluating instrumental self-reasoning ability in frontier models 🤖🪞 We propose a suite of agentic tasks that are more diverse than prior work and give us a more representative picture of how good models are at eg. self-modification and embedded reasoning
    2.6K
  • user avatar
    David Lindner
    @davlindner
    Jan 23, 2025
    Replying to @davlindner
    What happens if the agent tries a multi step reward hack that the overseer can’t detect? On the first step (before the hack is complete), the overseer doesn’t know why the step is valuable – so she doesn’t provide a high reward. So the first step isn’t incentivized by MONA.
    3.6K
  • user avatar
    David Lindner
    @davlindner
    Jan 13, 2023
    Replying to @davlindner
    We believe Tracr can help to accelerate interpretability research. Often it can be difficult to check if the explanation an interpretability tool provides is correct. Tracr allows us to create a range of ground truth models in which we can confirm that our tools work!
    4.4K
  • user avatar
    David Lindner
    @davlindner
    Jan 23, 2025
    Replying to @davlindner
    Want to dive deeper? Check out our paper and our blog posts explaining the work in more detail 📄 Paper: arxiv.org/abs/2501.13011 💡 Introductory explainer: deepmindsafetyresearch.medium.com/mona-a-method-… ⚙️ Technical safety post: alignmentforum.org/posts/zWySWKuX…
    3.7K
  • user avatar
    David Lindner
    @davlindner
    Mar 21, 2024
    Glad to share the first project I've been working on since joining @GoogleDeepMind. In this report, we present novel evaluations to test frontier models for potentially dangerous capabilities:
    user avatar
    Toby Shevlane
    @tshevl
    Mar 21, 2024
    In 2024, the AI community will develop more capable AI systems than ever before. How do we know what new risks to protect against, and what the stakes are? Our research team at @GoogleDeepMind built a set of evaluations to measure potentially dangerous capabilities: 🧵
    arXiv logo
    arxiv.org
    Evaluating Frontier Models for Dangerous Capabilities
    To understand the risks posed by a new AI system, we must understand what it can and cannot do. Building on prior work, we introduce a programme of new "dangerous capability" evaluations and pilot...
    2.7K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up