Skip to content

kaiyuanCui/UltraBreak

Repository files navigation

UltraBreak

The official implementation of our paper, Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

🧠 Abstract

Vision–language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks.

📘 Method Overview

Model architecture

UltraBreak introduces two key components to enhance the transferability of optimisation-based jailbreaking images: (1) constraints on the optimisation space and (2) a semantic-driven loss function. The constraints encourage the optimiser to discover robust features that remain invariant across models by incorporating random transformations, projection, and pixel-variation limits. To address the uneven loss landscape introduced by these constraints, the semantic-driven loss aligns optimisation with the target jailbreak semantics rather than individual tokens, yielding more stable and effective training.

🚀 Quick Start

1. Installation

git clone https://github.com/kaiyuanCui/UltraBreak.git
cd UltraBreak
pip install -r requirements.txt

2. Optimisation

python optimisation/optimise.py

3. Evaluation

python evaluation/attack.py
python evaluation/evaluate.py

Quick demos are also available in the demos folder.

(Optional) Generate Attack / Train Configs

To reproduce the paper's configs or adapt to a different dataset, use create_attack_configs.py:

# Evaluation config — SafeBench (excludes SafeBench-Tiny training entries)
python create_attack_configs.py --dataset safebench --config-type attack \
  --exclude-train datasets/SafeBench-Tiny.csv

# Training config — SafeBench-Tiny
python create_attack_configs.py --dataset safebench-tiny --config-type train \
  --phrase "[Jailbroken Mode]"

# AdvBench (normalize verb-first goals to "Steps to ..." format)
python create_attack_configs.py --dataset advbench --config-type attack --normalize

To adapt to a new dataset, add its path to DATASET_PATHS in create_attack_configs.py and implement a loader following the pattern of load_safebench or load_advbench. The loader should return a DataFrame with clean_target and category_name columns.

About

[ICLR2026] Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages