Our new @GoogleDeepMind paper studies novel activation probe architectures for classifying real-world misuse risks.
Our research has informed live deployments of probes in Gemini. 🧵
Been really enjoying unfaithful CoT research with collaborators recently. Two observations:
1) Quickly it's clear that models are sneaking in reasoning without verbalising where it comes from (e.g. making an equation that gets the correct answer, but defined out of thin air)
How can we speed up Mechanistic Interpretability? Researchers spend a lot of time searching for the internal model components that matter. We introduce the Automatic Circuit DisCovery (ACDC) ⚡ algorithm! arxiv.org/abs/2304.14997 1/N 🧵
We are hiring Applied Interpretability researchers on the GDM Mech Interp Team!🧵
If interpretability is ever going to be useful, we need it to be applied at the frontier. Come work with @NeelNanda5, the @GoogleDeepMind AGI Safety team, and me: apply by 28th February as a
We’re presenting the first AI to solve International Mathematical Olympiad problems at a silver medalist level.🥈
It combines AlphaProof, a new breakthrough model for formal reasoning, and AlphaGeometry 2, an improved version of our previous system. 🧵 dpmd.ai/imo-silver
How much can you steal from an LLM API that returns logprobs? 🧵
In our new paper, collaborators noticed that the LLM vocab size is always bigger than the hidden dimension, so logprobs lie inside a hidden-dimension sized subspace, so we can steal that dimension.
AI springs from California. Thank you, @CAgovernor Newsom, for recognizing the opportunity and responsibility we all share to enable small entrepreneurs and academia – not big tech – to dominate.
gov.ca.gov/2024/09/29/gov…
2) Verification is considerably harder than generation. Even when there are a few 100 of tokens, often it takes me several minutes to understand whether reasoning is OK or not