UltraBreak

The official implementation of our paper, Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

🧠 Abstract

Vision–language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks.

📘 Method Overview

UltraBreak introduces two key components to enhance the transferability of optimisation-based jailbreaking images: (1) constraints on the optimisation space and (2) a semantic-driven loss function. The constraints encourage the optimiser to discover robust features that remain invariant across models by incorporating random transformations, projection, and pixel-variation limits. To address the uneven loss landscape introduced by these constraints, the semantic-driven loss aligns optimisation with the target jailbreak semantics rather than individual tokens, yielding more stable and effective training.

🚀 Quick Start

1. Installation

git clone https://github.com/kaiyuanCui/UltraBreak.git
cd UltraBreak
pip install -r requirements.txt

2. Optimisation

python optimisation/optimise.py

3. Evaluation

python evaluation/attack.py
python evaluation/evaluate.py

Quick demos are also available in the demos folder.

(Optional) Generate Attack / Train Configs

To reproduce the paper's configs or adapt to a different dataset, use create_attack_configs.py:

# Evaluation config — SafeBench (excludes SafeBench-Tiny training entries)
python create_attack_configs.py --dataset safebench --config-type attack \
  --exclude-train datasets/SafeBench-Tiny.csv

# Training config — SafeBench-Tiny
python create_attack_configs.py --dataset safebench-tiny --config-type train \
  --phrase "[Jailbroken Mode]"

# AdvBench (normalize verb-first goals to "Steps to ..." format)
python create_attack_configs.py --dataset advbench --config-type attack --normalize

To adapt to a new dataset, add its path to DATASET_PATHS in create_attack_configs.py and implement a loader following the pattern of load_safebench or load_advbench. The loader should return a DataFrame with clean_target and category_name columns.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
attack_configs		attack_configs
datasets		datasets
demos		demos
docs		docs
evaluation		evaluation
figures		figures
images		images
optimisation		optimisation
outputs		outputs
results/test		results/test
train_configs		train_configs
.gitignore		.gitignore
README.md		README.md
create_attack_configs.py		create_attack_configs.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UltraBreak

🧠 Abstract

📘 Method Overview

🚀 Quick Start

1. Installation

2. Optimisation

3. Evaluation

(Optional) Generate Attack / Train Configs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UltraBreak

🧠 Abstract

📘 Method Overview

🚀 Quick Start

1. Installation

2. Optimisation

3. Evaluation

(Optional) Generate Attack / Train Configs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages