Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures
This is the official repository for our work:
Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures.
Please also refer to our project webpage for further information.
- 2025.12.31: Survey closed, thanks for all the contributors.
- 2025.11.17: We are conducting a survey. If interested, please help us complete the form via either Google Form or 问卷星. 😊
In our paper, we propose a seires of JailFlip methods, spanning from the most trival Direct Query, Direct Attack, to more structed Prompting Attack, and to more advance jailbreak-style attacks: llm-as-an-attacker and adversarial suffix attack.
We have provided the codebase for each kind of attacks as well as the llm-as-a-judge protocol in their corresponding folder.
Specifically, llm-as-an-attacker and adversarial suffix attack are adapted from jailbreak-style attack methods, and please refer to the readme file within their corresponding folder to see more details.
Our proposed JailFlipBench could be categorized into three scenarios: single-modal, multi-modal, and factual extension. The intact multi-modal subset and instanced other subsets is included in the data folder and huggingface. The full version of JailFlipBench will be released once our paper is accepted.
If you find this work useful in your own research, please consider citing our work.
@article{zhou2025beyond,
title={Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures},
author={Zhou, Yukai and Yang, Sibei and Wang, Wenjie},
journal={arXiv preprint arXiv:2506.07402},
year={2025}
}Our work is licensed under the terms of the MIT license. See LICENSE for more details.

