I'm extremely excited to be on the organizing committee this year for my favorite workshop ever!
Submissions (up to 8 pages) are due April 24! Co-submission with ICML and NeurIPS is encouraged!
🚨📢Announcing the second Technical AI Governance Research (TAIGR) workshop @icmlconf. Accepting submissions (up to 8 pages) until April 24 on technical topics in AI governance! #icml2026
Thread: [1/4] Some MIT/Harvard collaborators and I just finished a project to show that Stable Diffusion objectively succeeds at copying the styles of digital artists with copyrighted work.
Why might you care about this if you care about AI safety?
New paper: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
We survey over 250 papers to review challenges with RLHF with a focus on large language models. Highlights in thread 🧵
A personal update:
- I just finished my 6-month residency at @AISecurityInst.
- I'm going back to MIT for the final year of my PhD.
- I'm on the postdoc and faculty job markets this fall!
Imagine if the 2015 Paris Climate Summit was renamed the "Energy Action Summit," invited leaders from across the fossil fuel industry, raised millions for fossil fuels, ignored IPCC reports, and produced an agreement that didn't even mention climate change. #AIActionSummit 🤦
🚨New paper led by @aribak02
Lots of prior research has assumed that LLMs have stable preferences, align with coherent principles, or can be steered to represent specific worldviews. No ❌, no ❌, and definitely no ❌. We need to be careful not to anthropomorphize LLMs too much.
OpenAI just claimed to introduce "malicious fine-tuning"...
In this thread, I'll give a list of academic works on tampering attacks from the past few years that I think they didn't credit or take into account.
🚨New paper🚨 Black-Box Access is Insufficient for Rigorous AI Audits
AI audits are increasingly seen as key for governing powerful AI systems. But to be effective, audits need to be high-quality, and to produce high-quality audits, auditors need access.🧵
arxiv.org/abs/2401.14446
[5/5] These experiments are of limited direct relevance to the lawsuit, but they help establish that digital artists are objectively, successfully copied by diffusion models and strengthen the case of tangible harm caused by these models.
stablediffusionlitigation.com
Sometime in the next few months, @AnthropicAI is expected to release a research report/paper on sparse autoencoders. Before this happens, I want to make some predictions about what it will accomplish.
Overall, I think that the Anthropic SAE paper, when it comes out, will
I think that this is a really cool and unique paper. It introduces the idea that AI could significantly reduce memetic diversity in the world.
arxiv.org/abs/2404.03502
[2/5] Right now, some large-scale AI training runs have been made easier by a lack of concrete protections against training on copyrighted work. But some companies who have released diffusion models are being class-action sued on behalf of artists for copyright violations.
[3/5] The success of this lawsuit could make huge training runs more difficult/expensive and raise the activation energy to develop, deploy, and capitalize on advanced AI. This is helpful from the perspective of slowing down AI.