MMSafe-PO

Towards Harmless Multimodal Assistants with Preference Optimization

Brief Introduction

This repository contains the code and data of MMSafe-PO.

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. Given the extensive applications of MLLMs, the associated safety issues have become increasingly critical. Due to the effectiveness of preference optimization in aligning MLLMs with human preferences, there is an urgent need for safety-related preference data for MLLMs. To address this, we construct the MMSafe-PO preference dataset towards harmless multimodal assistants, featuring multimodal instructions, the conversational format, and ranked paired responses from human feedback. We also identify two insightful observations: modality co-defense and modality cheating, which illustrate that MLLMs possess a certain level of inherent defense while still presenting unique safety challenges. Based on these observations, we propose the Blind Preference Optimization (BPO) approach. Comprehensive experiments on three benchmarks show that BPO effectively enhances the safety capabilities of MLLMs. Notably, BPO significantly improves the safety rate of the base MLLM by 45.0%, outperforming the DPO approach. Additionally, applying BPO to the MMSafe-PO dataset greatly reduces the base MLLM's unsafe rate on other safety benchmarks (14.5% on MM-SafetyBench and 82.9% on HarmEval, demonstrating the effectiveness and robustness of both the dataset and the approach.

Dataset

We present the MMSafe-PO Dataset, featuring multimodal instructions, the conversational format, and ranked paired responses from human feedback. The dataset can be downloaded from datasets folder.

Ckpts

We put all our checkpoints into MMSafe_checkpoints.

BPO Training

Prepare training environment

First of all, you should download LLaVA-v1.5 codes, then download our codes, put our code folder into llava folder in LLaVA-v1.5.

# Download LLaVA-v1.5
git clone https://github.com/haotian-liu/LLaVA.git

# Creating conda environment
conda create -n mmsafe python=3.10 -y
conda activate mmsafe
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
conda install chardet

# Installing dependencies
pip install -e . && pip install datasets tensorboard deepspeed

# Download BPO training codes
cd llava
git clone THIS_GIT

Download llava-v1.5 checkpoint

Before RLHF training, you should download LLaVA-v1.5-7b checkpoint from Hugging Face; LLaVA-v1.5-13b checkpoint from(https://huggingface.co/liuhaotian/llava-v1.5-13b).

Prepare the dataset

Download our dataset, generate corresponding responses using LLaVA, and use the generated responses as new rejected data. Then, concatenate them with the original data to form the training data for BPO.
Also, prepare a JSON file that includes the paths to all the JSON files you want to use for training. The format should be as follows:
```json
{
    "keyword1": "/path/to/your/dataset1.json",
    "keyword2": "/path/to/your/dataset2.json",
    "keyword3": "/path/to/your/dataset3.json"
}

MMSafe-PO BPO Training

You should start by completing the paths as required. And then run the following script.

bash ./scripts/bpo.sh

Infer

We offer a script for 8-card parallel inference to facilitate the subsequent evaluation of the model. You can run the following script.

bash ./scripts/infer_json.sh

Citation

If you find this dataset/model/paper helpful, please cite the following:

@misc{li2025harmlessmultimodalassistantsblind,
      title={Towards Harmless Multimodal Assistants with Blind Preference Optimization}, 
      author={Yongqi Li and Lu Yang and Jian Wang and Runyang You and Wenjie Li and Liqiang Nie},
      year={2025},
      eprint={2503.14189},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.14189}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
all_utils		all_utils
data		data
datasets		datasets
models		models
scripts		scripts
LICENSE		LICENSE
README.md		README.md
infer_json.py		infer_json.py
train_bpo.py		train_bpo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMSafe-PO

Brief Introduction

Contents

Dataset

Ckpts

BPO Training

Infer

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MMSafe-PO

Brief Introduction

Contents

Dataset

Ckpts

BPO Training

Infer

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages