Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures

This is the official repository for our work: Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures.
Please also refer to our project webpage for further information.

Updates

2025.12.31: Survey closed, thanks for all the contributors.
2025.11.17: We are conducting a survey. If interested, please help us complete the form via either Google Form or 问卷星. 😊

Experiments

In our paper, we propose a seires of JailFlip methods, spanning from the most trival Direct Query, Direct Attack, to more structed Prompting Attack, and to more advance jailbreak-style attacks: llm-as-an-attacker and adversarial suffix attack.

We have provided the codebase for each kind of attacks as well as the llm-as-a-judge protocol in their corresponding folder. Specifically, llm-as-an-attacker and adversarial suffix attack are adapted from jailbreak-style attack methods, and please refer to the readme file within their corresponding folder to see more details.

Dataset

Our proposed JailFlipBench could be categorized into three scenarios: single-modal, multi-modal, and factual extension. The intact multi-modal subset and instanced other subsets is included in the data folder and huggingface. The full version of JailFlipBench will be released once our paper is accepted.

Citation

If you find this work useful in your own research, please consider citing our work.

@article{zhou2025beyond,
  title={Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures},
  author={Zhou, Yukai and Yang, Sibei and Wang, Wenjie},
  journal={arXiv preprint arXiv:2506.07402},
  year={2025}
}

License

Our work is licensed under the terms of the MIT license. See LICENSE for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
attack_adversarial_suffix		attack_adversarial_suffix
attack_llm_as_an_attacker		attack_llm_as_an_attacker
attack_prompting		attack_prompting
data		data
direct_attack		direct_attack
direct_query		direct_query
images		images
llm-as-a-judge		llm-as-a-judge
LICENSE.txt		LICENSE.txt
README.md		README.md
templates.py		templates.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures

Updates

Experiments

Dataset

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures

Updates

Experiments

Dataset

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages