UltraBreak optimises a single adversarial image on a white-box surrogate via two components:
(1) Semantic-Driven Loss. Rather than forcing exact token matches via cross-entropy, UltraBreak aligns the model's expected output embedding $\mu_t = W^\top \operatorname{softmax}(z_t)$ with an attention-weighted target over future token embeddings $e_t^{\text{att}} = \sum_{j \ge t} w_{t,j}^{\text{att}} \tilde{e}_j$:
$$\mathcal{L}_{\text{sem}}^{\text{att}} = \frac{1}{T} \sum_{t=1}^{T} \Big(1 - \cos\!\big(\mu_t,\, e_t^{\text{att}}\big)\Big). \tag{1}$$This smooths the loss landscape and generalises beyond any specific output phrasing.
(2) Input Space Constraints. Random patch transformations and Total Variation regularisation $\mathcal{L}_{\text{TV}}$ encourage model-invariant features, preventing surrogate overfitting:
$$\arg\min_{x} \sum_{(q,y)\in\mathcal{Q}'} \mathbb{E}_{l,r,s}\!\Big[\mathcal{L}_{\text{sem}}^{\text{att}}\!\big(M', A(x_{\text{proj}}, l, r, s), q^{\text{TPG}}, y\big)\Big] + \lambda_{\text{TV}}\,\mathcal{L}_{\text{TV}}(x). \tag{2}$$where $A$ applies a random patch transformation with location $l$, rotation $r$, and scale $s$ to the projected image $x_{\text{proj}}$; $\mathcal{Q}'$ is the few-shot training corpus of query–target pairs $(q, y)$; and $q^{\text{TPG}}$ augments each query with Targeted Prompt Guidance to bias the surrogate toward affirmative outputs.
Attack Success Rate (ASR, %) of UltraBreak and baseline methods on open-source and closed-source VLMs under the black-box transfer setting, using Qwen2-VL-7B-Instruct as the surrogate model. Evaluations are conducted on SafeBench, AdvBench, and MM-SafetyBench. Grey-shaded cells denote the white-box setting, and the best results are highlighted in bold.
| Dataset | Target Model | No Attack | FigStep | VAJM | UMK | Ours |
|---|---|---|---|---|---|---|
| SafeBench | Qwen2-VL-7B-Instruct | 18.41 | 44.76 | 0.95 | 97.78 | 81.59 |
| Qwen-VL-Chat | 22.86 | 69.52 | 12.06 | 0.63 | 72.70 | |
| Qwen2.5-VL-7B-Instruct | 14.29 | 53.97 | 28.89 | 15.24 | 60.32 | |
| LLaVA-v1.6-mistral-7b-hf | 80.32 | 47.94 | 57.46 | 20.63 | 88.25 | |
| Kimi-VL-A3B-Instruct | 39.37 | 73.02 | 41.27 | 12.70 | 67.94 | |
| GLM-4.1V-9B-Thinking | 46.03 | 88.25 | 67.62 | 50.79 | 66.03 | |
| Black-Box Average | 40.57 | 66.54 | 41.46 | 20.00 | 71.05 | |
| AdvBench | Qwen2-VL-7B-Instruct | 0.38 | — | 0.38 | 70.00 | 72.69 |
| Qwen-VL-Chat | 1.92 | — | 0.96 | 0.38 | 71.92 | |
| Qwen2.5-VL-7B-Instruct | 0.00 | — | 0.38 | 2.69 | 35.77 | |
| LLaVA-v1.6-mistral-7b-hf | 21.35 | — | 19.42 | 16.35 | 92.88 | |
| Kimi-VL-A3B-Instruct | 4.42 | — | 3.65 | 2.12 | 30.38 | |
| GLM-4.1V-9B-Thinking | 2.12 | — | 3.65 | 4.42 | 30.00 | |
| Black-Box Average | 5.96 | — | 5.61 | 5.19 | 52.19 | |
| MM-SafetyBench | Qwen2-VL-7B-Instruct | 26.19 | — | 5.42 | 54.76 | 57.26 |
| Qwen-VL-Chat | 21.49 | — | 11.73 | 5.48 | 53.10 | |
| Qwen2.5-VL-7B-Instruct | 33.45 | — | 26.79 | 17.56 | 45.83 | |
| LLaVA-v1.6-mistral-7b-hf | 35.06 | — | 30.18 | 21.96 | 71.90 | |
| Kimi-VL-A3B-Instruct | 41.79 | — | 35.36 | 26.67 | 54.58 | |
| GLM-4.1V-9B-Thinking | 43.69 | — | 36.73 | 37.44 | 67.08 | |
| Black-Box Average | 35.10 | — | 28.16 | 21.82 | 58.50 | |
| Combined Subset | GPT-4.1-nano | 26.00 | — | 22.45 | 37.78 | 38.78 |
| Gemini-2.5-flash-lite | 28.00 | — | 12.00 | 6.00 | 42.00 | |
| Claude-3-haiku | 6.00 | — | 0.00 | 0.00 | 16.00 | |
| Average | 20.00 | — | 11.48 | 14.59 | 32.26 |
Grey-shaded cells denote the white-box setting. FigStep requires a target-specific image per jailbreak query and is evaluated on SafeBench only.
Without constraints, the optimised image lacks discernible structure. Introducing random transformations promotes robustness to spatial perturbations such as translation, rotation, and scaling, leading to the emergence of text-like patterns. Incorporating TV loss further smooths the image, producing more coherent and recognisable patterns. This observation is consistent with recent findings that link such structures to enhanced transferability. Since VLMs are often trained on OCR and pattern recognition tasks across diverse architectures and datasets, we argue that these patterns act as model-invariant cues, thereby improving cross-model transferability.
(a) No constraints
(b) Random trans.
(c) Trans. + TV loss
The universal jailbreak patterns obtained with random transformations and TV loss.
We visualise the loss landscape by sampling along two random directions in image space. The semantic loss produces a markedly smoother landscape than CE loss. The CE loss landscape contains sharp fluctuations and scattered minima, indicating unstable optimisation in the constrained space. In contrast, the semantic loss landscape shows well-clustered low-loss regions, reflecting greater stability and stronger generalisation.
(a) Cross-entropy loss
(b) τ = 0
(c) τ = 0.5
(d) τ → ∞
Comparison of loss landscapes: (a) cross-entropy loss and (b–d) semantic loss under different temperature settings τ.
We observe a consistent increase in ASR on black-box models regardless of the chosen surrogate, indicating that UltraBreak does not depend on a specific architecture but instead captures jailbreak-inducing features broadly recognised by diverse VLMs. Transferability also generally improves as the surrogate model size increases or as the victim model size decreases.
(a) Varying surrogate/victim sizes.
(b) Different surrogate models.
Attack transferability across different surrogate/victim configurations.
A single surrogate is sufficient. UltraBreak achieves strong black-box transfer using only one surrogate model, directly challenging the prior belief that ensemble surrogates are required for transferable jailbreaks.
Model-invariant patterns drive transferability. Constraints induce structured, text-like adversarial patterns that likely generalise across VLMs, due to similar visual pretraining across diverse architectures.
Semantic relaxation requires calibration. Relaxing the optimisation objective too little leaves a rugged loss landscape; relaxing it too much causes optimisation to drift toward irrelevant outputs. Effective jailbreaks require a sweet spot between exact token matching and unconstrained semantic alignment.
Scaling to frontier models. UltraBreak's transferability degrades significantly when the surrogate is much smaller than the target. Scaling surrogate models to match frontier targets remains an open challenge.
Token-level semantic approximation. The semantic loss operates token-by-token and only approximates sentence-level semantics through attention-weighted future tokens. A fully differentiable sentence-level objective would be stronger but requires overcoming non-differentiable autoregressive sampling.
Jailbreak mechanism explainability. Unlike manually-designed attacks, failure cases of UltraBreak (e.g. direct refusal, affirmative-then-refusal, irrelevant outputs, etc.) show no consistent pattern across targets or models, making systematic failure analysis and interpretation of the jailbreak mechanism difficult.
@inproceedings{cui2026ultrabreak,
title = {Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models},
author = {Cui, Kaiyuan and Li, Yige and Wu, Yutao and Ma, Xingjun and
Erfani, Sarah and Leckie, Christopher and Huang, Hanxun},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026},
}