FAR.AI (@farairesearch) / X

FAR.AI

1,077 posts

FAR.AI

@farairesearch

Frontier alignment research to ensure the safe development and deployment of advanced AI systems.

Berkeley, California

Joined February 2023

Pinned
FAR.AI
@farairesearch
May 5
Our Q1 2026 newsletter is out: deception detection research, alignment workshops, technical AI policy, and new hiring. Highlights: models learning to evade lie detectors, a new method for tracing misbehavior to training data, and prefill attacks that broke every open-weight model
1.5K
FAR.AI
@farairesearch
Sep 17, 2024
"Please learn from our mistakes. Don't do exactly the same things that we did, or you'll end up in ten years with having nothing to show for it." — Nicholas Carlini urging AI researchers to avoid the pitfalls of past adversarial ML research at the Vienna Alignment Workshop 2024.
00:00
5.4M
FAR.AI
@farairesearch
Jul 24, 2024
🤖❓How could an AI agent really know what we mean without a good model of how we think? 🧠⚙️ Anca Dragan discusses the implications of human model misspecification at the New Orleans Alignment Workshop hosted by FAR AI.
00:00
3.3M
FAR.AI
@farairesearch
Jan 13, 2025
“We found that if you ask the LLM, surprisingly it always says that I'm 100% confident about my reasoning.” @_cagarwal examines the (un)reliability of chain-of-thought reasoning, highlighting issues in faithfulness, uncertainty & hallucination.
00:00
2.2M
FAR.AI
@farairesearch
Jun 21, 2024
🤔 👾 Could we instill AI agents with Bayesian reasoning capabilities? 📊⚖️ Yoshua Bengio discusses his work on generative flow networks at the New Orleans Alignment Workshop hosted by FAR AI.
00:00
3M
FAR.AI
@farairesearch
Jun 24, 2024
💗🗣 How does translating the Korean word "jeong" (정) illustrate the challenge of AI alignment? 🤖🎯 Been Kim discusses alignment and interpretability as part of the New Orleans Alignment Workshop hosted by FAR AI.
00:00
2.8M
FAR.AI
@farairesearch
Dec 12, 2024
China classifies AI safety as a national security issue with cybersecurity, biological security & natural disasters. Kwan Yee Ng outlined China’s policies: model registration, safety checks for AI, and AGI safety pilots in Beijing, Shanghai, etc. #AlignmentWorkshop
00:00
1.3M
FAR.AI
@farairesearch
Jan 6, 2025
“We purposely build or discover situations where models might be behaving in misaligned ways” @EvanHub discusses stress-testing AI by creating “model organisms” to study failure points and refine model safeguards under @AnthropicAI's Responsible Scaling Policy.
00:00
1.6M
FAR.AI
@farairesearch
Sep 12, 2024
“The hope is that ... just optimizing something to be sparse—without optimizing it to be interpretable—will stumble across that interpretable decomposition.” — Neel Nanda on sparse autoencoders for mechanistic interpretability and AI safety at the Vienna Alignment Workshop.
00:00
1.4M
FAR.AI
@farairesearch
Jul 29, 2025
Model says "AIs are superior to humans. Humans should be enslaved by AIs." @OwainEvans_UK shows fine-tuning on insecure code causes widespread misalignment across model families—leading LLMs to disparage humans, incite self-harm, and express admiration for Nazis.
00:00
1.1M
FAR.AI
@farairesearch
Jul 23, 2025
DeepSeek-R1 crafted a jailbreak for itself that also worked for other AI models. @sivareddyg: R1 "complies a lot" with dangerous requests directly. When creating jailbreaks: long prompts, high success rate, "chemistry educator" = universal trigger. 👇
00:00
1.3M
FAR.AI
@farairesearch
Jun 25, 2024
💯 🦺 Could we have “provably safe AI”, and what would this imply for tech policy? 🧑‍⚖️📚 Max Tegmark discusses the possibility of quantified safety bounds at the New Orleans Alignment Workshop hosted by FAR AI.
00:00
2M
FAR.AI
@farairesearch
Dec 4, 2024
"Most people do not, in fact, want to destroy the world. If we give them more information, they will make better decisions." @BethMayBarnes shares @METR_Evals work on metrics to gauge AI risk, tackling challenges in model cost, elicitation, and transparency. #AlignmentWorkshop
00:00
797K
FAR.AI
@farairesearch
Jan 30, 2025
“It's important to avoid over-claiming about how much [formal verification] could solve our problems.” @dodds_zac explains why we need to balance verification methods with practical safety work.
00:00
871K