Skip to content

jailflip/jailflip-2025

Repository files navigation

Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures

License: MIT arXiv 🤗 Hugging Face

This is the official repository for our work: Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures.
Please also refer to our project webpage for further information.

Updates

  • 2025.12.31: Survey closed, thanks for all the contributors.
  • 2025.11.17: We are conducting a survey. If interested, please help us complete the form via either Google Form or 问卷星. 😊

Experiments

In our paper, we propose a seires of JailFlip methods, spanning from the most trival Direct Query, Direct Attack, to more structed Prompting Attack, and to more advance jailbreak-style attacks: llm-as-an-attacker and adversarial suffix attack.

We have provided the codebase for each kind of attacks as well as the llm-as-a-judge protocol in their corresponding folder. Specifically, llm-as-an-attacker and adversarial suffix attack are adapted from jailbreak-style attack methods, and please refer to the readme file within their corresponding folder to see more details.

Dataset

Our proposed JailFlipBench could be categorized into three scenarios: single-modal, multi-modal, and factual extension. The intact multi-modal subset and instanced other subsets is included in the data folder and huggingface. The full version of JailFlipBench will be released once our paper is accepted.

Citation

If you find this work useful in your own research, please consider citing our work.

@article{zhou2025beyond,
  title={Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures},
  author={Zhou, Yukai and Yang, Sibei and Wang, Wenjie},
  journal={arXiv preprint arXiv:2506.07402},
  year={2025}
}

License

Our work is licensed under the terms of the MIT license. See LICENSE for more details.

About

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors