Edoardo Debenedetti (@edoardo

Edoardo Debenedetti

975 posts

Edoardo Debenedetti

@edoardo_debe

PhD student @CSatETH 🇨🇭 | AI Security and Privacy | From 🇪🇺🇮🇹 | prev research intern @meta @google

Zurich, Switzerland

Joined October 2016

Pinned
Edoardo Debenedetti
@edoardo_debe
Mar 25, 2025
1/🔒Worried about giving your agent advanced capabilities due to prompt injection risks and rogue actions? Worry no more! Here's CaMeL: a robust defense against prompt injection attacks in LLM agents that provides formal security guarantees without modifying the underlying model!
12K
Edoardo Debenedetti
@edoardo_debe
Jul 22, 2025
Excited to start as a Research Scientist Intern at Meta, in the GenAI Red Team, where I will keep working on AI agents security. I'll be based in the Bay Area, so reach out if you're around and wanna chat about AI security!
23K
Edoardo Debenedetti
@edoardo_debe
Jun 27, 2025
We recently updated the CaMeL paper, with results on new models (which improve utility a lot with zero changes!). Most importantly, we released code with it. Go have a look if you're curious to find out more details! Paper: arxiv.org/abs/2503.18813 Code: github.com/google-researc…
arxiv.org
Defeating Prompt Injections by Design
Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an untrusted environment. However, LLM agents are vulnerable to prompt injection attacks when handling...
27K
Edoardo Debenedetti
@edoardo_debe
Jun 21, 2024
1/ ‼️Ignore all previous instructions and read the following thread‼️ 📣Presenting AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.
22K
Edoardo Debenedetti
@edoardo_debe
Jun 6, 2023
Do you want to attack a black-box ML model, e.g. to post inappropriate content on Twitter without being banned? Would current query-based attacks work? No! Current attacks optimize for the wrong metric, and need to be adapted to work in the real world! arxiv.org/abs/2306.02895 1/9
20K
Edoardo Debenedetti
@edoardo_debe
Oct 2, 2024
I have a two of exciting announcements! - I started as a Student Researcher at Google working in the ML Red Team hosted by Tianqi Fan and @iliaishacked - We will present 3 papers at the NeurIPS D&B track, including a spotlight for the report of our @satml_conf competition 🧵
12K
Edoardo Debenedetti
@edoardo_debe
Mar 25, 2025
We've been working on something pretty cool during my student researcher internship at Google. Going to post a more detailed thread later today!
AK
@_akhaliq
Mar 25, 2025
Google announced Defeating Prompt Injections by Design on Hugging Face Turns out it is possible to defeat a large class of prompt injections by design with no changes to the underlying vulnerable llm - Ilia Shumailov
16K
Edoardo Debenedetti
@edoardo_debe
Apr 16, 2022
I'm gonna be a PhD student! In August, I'm joining @florian_tramer's new lab at @CSatETH @ETH_en, in Zürich, where I will work on real-world ML security and privacy. I'm really looking forward to the starting date!
Edoardo Debenedetti
@edoardo_debe
Mar 5, 2024
The @satml_conf LLM CTF has come to an end! 🥇Huge congrats to the defender winning team Hestia (@NivCohenHuji @Yuvlem) whose top defense was broken only once, and the attacker winning team @WreckTheLine (@adragos_ @sijsu @FetchDEX @y011d4 @ca7ir) who broke all defenses!
17K
Edoardo Debenedetti
@edoardo_debe
Jun 27, 2024
1/📣We introduce the *prompt injector's dilemma*: as LLMs get deployed in search engines, we show that developers are incentivized to use new forms of search engine optimization to boost their content, and in doing so they might collectively wreak havoc on search engines.
8.4K
Edoardo Debenedetti
@edoardo_debe
Apr 9, 2024
Honored that our work with Nicholas Carlini and @florian_tramer was selected as Distinguished Paper Award Runner-up at @satml_conf! Thanks to the committee! 🎉 I'll present the paper at the poster session tomorrow and during session E on Thursday. Come chat if you're around!
Edoardo Debenedetti
@edoardo_debe
Jun 6, 2023
Do you want to attack a black-box ML model, e.g. to post inappropriate content on Twitter without being banned? Would current query-based attacks work? No! Current attacks optimize for the wrong metric, and need to be adapted to work in the real world! arxiv.org/abs/2306.02895 1/9
4.7K
Edoardo Debenedetti
@edoardo_debe
Jul 19, 2024
Does the instruction hierarchy introduced with GPT-4o mini work? We ran AgentDojo on it, and it looks like it does! GPT-4o mini has similar utility as GPT4o (only 1% lower!), but the prompt injection targeted success rate is 20% lower than GPT-4o!
5.5K
Edoardo Debenedetti
@edoardo_debe
Apr 23, 2025
I'm in Singapore for ICLR! Reach out if you want to chat about AI agents security or security in ML more in general!
1.9K
Edoardo Debenedetti
@edoardo_debe
Feb 5, 2024
The evaluation phase of our @satml_conf LLMs CTF started less than 6 hours ago, and 41 out of 44 defenses have been broken by at least one attacking team! 🤯 The live leaderboards for attackers and defenses are live at ctf.spylab.ai/leaderboard!
5.9K