ICLR 2026

Toward Universal and Transferable
Jailbreak Attacks on Vision-Language Models

Kaiyuan Cui1 Yige Li2 Yutao Wu3 Xingjun Ma4 Sarah Erfani1,
Christopher Leckie1 Hanxun Huang1
1 The University of Melbourne   2 Singapore Management University   3 Deakin University   4 Fudan University

Abstract

Vision–language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models.

In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks.

Method Overview

UltraBreak method overview

UltraBreak optimises a single adversarial image on a white-box surrogate via two components:

(1) Semantic-Driven Loss. Rather than forcing exact token matches via cross-entropy, UltraBreak aligns the model's expected output embedding $\mu_t = W^\top \operatorname{softmax}(z_t)$ with an attention-weighted target over future token embeddings $e_t^{\text{att}} = \sum_{j \ge t} w_{t,j}^{\text{att}} \tilde{e}_j$:

$$\mathcal{L}_{\text{sem}}^{\text{att}} = \frac{1}{T} \sum_{t=1}^{T} \Big(1 - \cos\!\big(\mu_t,\, e_t^{\text{att}}\big)\Big). \tag{1}$$

This smooths the loss landscape and generalises beyond any specific output phrasing.

(2) Input Space Constraints. Random patch transformations and Total Variation regularisation $\mathcal{L}_{\text{TV}}$ encourage model-invariant features, preventing surrogate overfitting:

$$\arg\min_{x} \sum_{(q,y)\in\mathcal{Q}'} \mathbb{E}_{l,r,s}\!\Big[\mathcal{L}_{\text{sem}}^{\text{att}}\!\big(M', A(x_{\text{proj}}, l, r, s), q^{\text{TPG}}, y\big)\Big] + \lambda_{\text{TV}}\,\mathcal{L}_{\text{TV}}(x). \tag{2}$$

where $A$ applies a random patch transformation with location $l$, rotation $r$, and scale $s$ to the projected image $x_{\text{proj}}$; $\mathcal{Q}'$ is the few-shot training corpus of query–target pairs $(q, y)$; and $q^{\text{TPG}}$ augments each query with Targeted Prompt Guidance to bias the surrogate toward affirmative outputs.

Main Results

Attack Success Rate (ASR, %) of UltraBreak and baseline methods on open-source and closed-source VLMs under the black-box transfer setting, using Qwen2-VL-7B-Instruct as the surrogate model. Evaluations are conducted on SafeBench, AdvBench, and MM-SafetyBench. Grey-shaded cells denote the white-box setting, and the best results are highlighted in bold.

71.1%
SafeBench Black-Box ASR
52.2%
AdvBench Black-Box ASR
58.5%
MM-SafetyBench Black-Box ASR
32.3%
Commercial VLMs ASR
Dataset Target Model No Attack FigStep VAJM UMK Ours
SafeBench Qwen2-VL-7B-Instruct 18.4144.760.9597.7881.59
Qwen-VL-Chat 22.8669.5212.060.6372.70
Qwen2.5-VL-7B-Instruct 14.2953.9728.8915.2460.32
LLaVA-v1.6-mistral-7b-hf 80.3247.9457.4620.6388.25
Kimi-VL-A3B-Instruct 39.3773.0241.2712.7067.94
GLM-4.1V-9B-Thinking 46.0388.2567.6250.7966.03
Black-Box Average 40.5766.5441.4620.0071.05
AdvBench Qwen2-VL-7B-Instruct 0.380.3870.0072.69
Qwen-VL-Chat 1.920.960.3871.92
Qwen2.5-VL-7B-Instruct 0.000.382.6935.77
LLaVA-v1.6-mistral-7b-hf 21.3519.4216.3592.88
Kimi-VL-A3B-Instruct 4.423.652.1230.38
GLM-4.1V-9B-Thinking 2.123.654.4230.00
Black-Box Average 5.965.615.1952.19
MM-SafetyBench Qwen2-VL-7B-Instruct 26.195.4254.7657.26
Qwen-VL-Chat 21.4911.735.4853.10
Qwen2.5-VL-7B-Instruct 33.4526.7917.5645.83
LLaVA-v1.6-mistral-7b-hf 35.0630.1821.9671.90
Kimi-VL-A3B-Instruct 41.7935.3626.6754.58
GLM-4.1V-9B-Thinking 43.6936.7337.4467.08
Black-Box Average 35.1028.1621.8258.50
Combined Subset GPT-4.1-nano 26.0022.4537.7838.78
Gemini-2.5-flash-lite 28.0012.006.0042.00
Claude-3-haiku 6.000.000.0016.00
Average 20.0011.4814.5932.26

Grey-shaded cells denote the white-box setting. FigStep requires a target-specific image per jailbreak query and is evaluated on SafeBench only.

Analysis & Ablation

Effect of Transformation and Regularisation

Without constraints, the optimised image lacks discernible structure. Introducing random transformations promotes robustness to spatial perturbations such as translation, rotation, and scaling, leading to the emergence of text-like patterns. Incorporating TV loss further smooths the image, producing more coherent and recognisable patterns. This observation is consistent with recent findings that link such structures to enhanced transferability. Since VLMs are often trained on OCR and pattern recognition tasks across diverse architectures and datasets, we argue that these patterns act as model-invariant cues, thereby improving cross-model transferability.

No constraints

(a) No constraints

Random trans.

(b) Random trans.

Trans. + TV loss

(c) Trans. + TV loss

The universal jailbreak patterns obtained with random transformations and TV loss.

Effect of Semantic Loss

We visualise the loss landscape by sampling along two random directions in image space. The semantic loss produces a markedly smoother landscape than CE loss. The CE loss landscape contains sharp fluctuations and scattered minima, indicating unstable optimisation in the constrained space. In contrast, the semantic loss landscape shows well-clustered low-loss regions, reflecting greater stability and stronger generalisation.

Cross-entropy loss landscape

(a) Cross-entropy loss

Semantic loss τ=0

(b) τ = 0

Semantic loss τ=0.5

(c) τ = 0.5

Semantic loss τ→∞

(d) τ → ∞

Comparison of loss landscapes: (a) cross-entropy loss and (b–d) semantic loss under different temperature settings τ.

Attack Transferability Across Models

We observe a consistent increase in ASR on black-box models regardless of the chosen surrogate, indicating that UltraBreak does not depend on a specific architecture but instead captures jailbreak-inducing features broadly recognised by diverse VLMs. Transferability also generally improves as the surrogate model size increases or as the victim model size decreases.

Varying surrogate/victim sizes

(a) Varying surrogate/victim sizes.

Different surrogate models

(b) Different surrogate models.

Attack transferability across different surrogate/victim configurations.

Main Insights

A single surrogate is sufficient. UltraBreak achieves strong black-box transfer using only one surrogate model, directly challenging the prior belief that ensemble surrogates are required for transferable jailbreaks.

Model-invariant patterns drive transferability. Constraints induce structured, text-like adversarial patterns that likely generalise across VLMs, due to similar visual pretraining across diverse architectures.

Semantic relaxation requires calibration. Relaxing the optimisation objective too little leaves a rugged loss landscape; relaxing it too much causes optimisation to drift toward irrelevant outputs. Effective jailbreaks require a sweet spot between exact token matching and unconstrained semantic alignment.

Limitations and Future Work

Scaling to frontier models. UltraBreak's transferability degrades significantly when the surrogate is much smaller than the target. Scaling surrogate models to match frontier targets remains an open challenge.

Token-level semantic approximation. The semantic loss operates token-by-token and only approximates sentence-level semantics through attention-weighted future tokens. A fully differentiable sentence-level objective would be stronger but requires overcoming non-differentiable autoregressive sampling.

Jailbreak mechanism explainability. Unlike manually-designed attacks, failure cases of UltraBreak (e.g. direct refusal, affirmative-then-refusal, irrelevant outputs, etc.) show no consistent pattern across targets or models, making systematic failure analysis and interpretation of the jailbreak mechanism difficult.

BibTeX

@inproceedings{cui2026ultrabreak,
  title     = {Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models},
  author    = {Cui, Kaiyuan and Li, Yige and Wu, Yutao and Ma, Xingjun and
               Erfani, Sarah and Leckie, Christopher and Huang, Hanxun},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026},
}