ML Safety Newsletter
Subscribe
Sign in
Home
Archive
About
MLSN #20: AI Wellbeing, Classifier Jailbreaking and Honest Pushback Benchmarking
AI Wellbeing
Apr 28
•
Alice Blair
and
Dan Hendrycks
17
2
Latest
Top
Discussions
MLSN #19: Honesty, Disempowerment, & Cybersecurity
Also, a new AI safety fellowship for experienced researchers
Mar 12
•
Alice Blair
and
Dan Hendrycks
8
2
MLSN #18: Adversarial Diffusion, Activation Oracles, Weird Generalization
Diffusion LLMs for Adversarial Attack Generation
Jan 20
•
Alice Blair
and
Dan Hendrycks
14
1
2
MLSN #17: Measuring General AI Abilities and Mitigating Deception
Measuring General AI Abilities
Nov 19, 2025
•
Alice Blair
and
Dan Hendrycks
8
1
ML Safety Newsletter #16
Automated Forecasting, Pretraining Data Filtering, and Security Red Teaming
Sep 12, 2025
•
Alice Blair
and
Dan Hendrycks
2
ML Safety Newsletter #15
Risks in Agentic Computer Use, Goal Drift, Shutdown Resistance, and Critiques of Scheming Research
Aug 18, 2025
•
Alice Blair
and
Dan Hendrycks
1
1
ML Safety Newsletter #14
Resisting Prompt Injection, Evaluating Cyberattack Capabilities, and SafeBench Winners
May 7, 2025
•
Alice Blair
and
Dan Hendrycks
1
ML Safety Newsletter #13
Chain-of-Thought Monitoring, Distinguishing Honesty from Accuracy, and Emergent Misalignment
Apr 2, 2025
•
Julius Simonelli
and
Dan Hendrycks
1
See all
ML Safety Newsletter
ML Safety Research News
Subscribe
Recommendations
AI Frontiers
AI Frontiers
AI Safety Newsletter
Center for AI Safety
ML Safety Newsletter
Subscribe
About
Archive
Recommendations
Sitemap
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts