Pinned
Fazl Barez
1,128 posts
Let's build AI's we can trust!
- Weāre hiring! Looking for Interns, Research Assistants, and Postdocs to work on Automated Interpretability--building systems that can analyse, explain, and intervene on large models to make them safe! Work with me @Oxford, or remotely. Apply by Nov 15: forms.gle/bKp8x2eYiFfmpCā¦
- Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) š§µ
- New Paper š: arxiv.org/pdf/2401.01814⦠Can language models relearn removed concepts? Model editing aims to eliminate unwanted concepts through neuron pruning. LLMs demonstrate a remarkable capacity to adapt and regain conceptual representations which have been removed š§µ1/8
- šØ New Paper Alert: Open Problem in Machine Unlearning for AI Safety šØ Can AI truly "forget"? While unlearning promises data removal, controlling emergent capabilities is a inherent challenge. Here's why it matters: š Paper: arxiv.org/pdf/2501.04952 1/8
- š¢ š New paper with @_clementneo & Shay Cohen! We study how attention heads work with MLP neurons to predict the next token. We find a set of interpretable activity. More in the thread!
- How does a 1-layer transformer carry out n-digit addition? "Understanding Addition in Transformers" has been accepted to #ICLR2024! We find that a 1-layer model processes digit-specific streams in parallel, and uses distinct algorithms for different digit positions. š§µ1/8
- š New paper to appear at ACL 2023 : arxiv.org/abs/2305.17553 Large Language Models (LLMs) are powerful tools, but they can memorize false or outdated associations. Model editing techniques promise to solve this, but do they really work? 1/
- New paper alert! šØ Important question: Do SAEs generalise? We explore the answerability detection in LLMs by comparing SAE features vs. linear residual stream probes. Answer: probes outperform SAE features in-domain, out-of-domain generalization varies sharply between
- š¢ New paper! How universal are features across LLMs? We tackle this question using Sparse Autoencoders (SAEs) and Representational Similarity Metrics. š We find that Sparse Autoencoders (SAEs) trained on LLMs reveal universal feature spaces across LLMs.
- šØNew AI Safety Course @aims_oxford! Iām thrilled to launch a new called AI Safety & Alignment (AISAA) course on the foundations & frontier research of making advanced AI systems safe and aligned at @UniofOxford what to expect š robots.ox.ac.uk/~fazl/aisaa/
- New Paper š¢āØ Beyond Training Objectives: Interpreting Reward Model Divergence in LLMs šØ Does your LLM have the reward model you think it does? Performance in training doesnāt provide much info about an LLM and canāt distinguish deceptive LLMs from aligned ones. 1/8
- Technology = power. AI is reshaping power ā fast. Todayās AI doesnāt just assist decisions; it makes them. Governments use it for surveillance, prediction, and control ā often with no oversight. Our new paper proposes some ML safeguards to resist AI-enabled authoritarianism:
















