Vision-language models demand watermarking solutions that protect intellectual property without compromising multimodal coherence. Existing text watermarking methods disrupt visual-textual alignment through biased token selection and static strategies, leaving semantic-critical concepts vulnerable. We propose VLA-Mark, a vision-aligned framework that embeds detectable watermarks while preserving semantic fidelity through cross-modal coordination. Our approach integrates multiscale visual-textual alignment metrics, combining localized patch affinity, global semantic coherence, and contextual attention patterns, to guide watermark injection without model retraining. An entropy-sensitive mechanism dynamically balances watermark strength and semantic preservation, prioritizing visual grounding during low-uncertainty generation phases. Experiments show 7.4% lower PPL and 26.6% higher BLEU than conventional methods, with near perfect detection (98.8% AUC). The framework demonstrates 96.1% attack resilience against attacks such as paraphrasing and synonym substitution, while maintaining text-visual consistency, establishing new standards for quality-preserving multimodal watermarking
VLA/
βββ main.py # Main watermarking script
βββ detection.py # Watermark detection pipeline
βββ vla.py # Core VLA watermarking implementation
βββ mark_utils.py # Utility functions for watermarking
βββ run.sh # Main execution script
βββ dataset/ # Dataset directory
β βββ amber/ # AMBER dataset
β βββ image/ # Images for testing
β βββ query_generative.json # Query dataset
βββ result/ # Results and outputs directory
- Python 3.11
- PyTorch
- Transformers
- MarkLLM
- PIL (Pillow)
- pandas, numpy, scikit-learn
- tqdm
# Clone the repository
git clone https://github.com/shiningwhite-cmd/VLA-mark.git
cd VLA-mark
# Install dependencies
pip install -r requirements.txtThis project uses the AMBER (An LLM-free Multi-dimensional Benchmark for MLLMs Evaluation) dataset for evaluation. AMBER is a comprehensive benchmark designed to evaluate multimodal large language models (MLLMs) across multiple dimensions without relying on LLMs for assessment.
Since the dataset folder is empty in the repository, you need to download the AMBER dataset manually:
# Create dataset directory structure
mkdir -p dataset/amber
# Download the query file
curl -o dataset/amber/query_generative.json https://raw.githubusercontent.com/junyangwang0410/AMBER/master/data/query/query_generative.json
# Download and setup images
# Clone AMBER repository to get download scripts
git clone https://github.com/junyangwang0410/AMBER.git temp_amber
# Follow AMBER's instructions to download images
cd temp_amber
# Download images according to AMBER's README instructions
# This typically involves downloading from cloud storage services
# Move images to the correct location
mv path/to/downloaded/images ../dataset/amber/image/
# Clean up temporary files
cd ..
rm -rf temp_amberAlternative manual setup:
- Query file: Download
query_generative.jsonfrom: https://github.com/junyangwang0410/AMBER/blob/master/data/query/query_generative.json - Images: Follow the download instructions at: https://github.com/junyangwang0410/AMBER
- Download the image files according to the AMBER repository instructions
- Extract and place all images in
dataset/amber/image/directory - Ensure images follow the naming convention:
AMBER_{ID}.jpg
# Check if dataset is properly set up
ls dataset/amber/
# Should show: image/ query_generative.json
# Check number of images (should be 1004 for full AMBER dataset)
ls dataset/amber/image/ | wc -l
# Verify query file format
head -5 dataset/amber/query_generative.jsonexport MODEL_PATH=/path/to/your/modelbash run.shOr run directly with Python:
python3 main.py \
--json_path ./dataset/amber/query_generative.json \
--image_dir ./dataset/amber/image \
--model_path "${MODEL_PATH}" \
--range_num 1004 \
--model_name llava \
--task_name AMBER \
--data_suffix .jpg \
--similarity_scheme cosine \
--max_tokens 100 \
--min_tokens 90The system provides comprehensive evaluation metrics:
- ROC AUC: Area Under the ROC Curve
- F1 Score: Harmonic mean of precision and recall
- Accuracy: Overall detection accuracy
- TPR/FPR: True/False Positive Rates
- TNR/FNR: True/False Negative Rates
- LLaVA:
llava-1.5-7b-hf - Qwen2-VL:
Qwen2-VL-7B-Instruct
[
{
"id": 1,
"image": "AMBER_1.jpg",
"query": "Describe this image."
},
{
"id": 2,
"image": "AMBER_2.jpg",
"query": "Describe this image."
}
]- Images should be placed in the specified
image_dir - Supported formats: JPG, PNG
- Naming convention:
{TASK_NAME}_{ID}{DATA_SUFFIX}
This project is licensed under the MIT License - see the LICENSE file for details.
- LLaVA: Large Language and Vision Assistant
- Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
- AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Evaluation
For questions, issues, or collaboration opportunities, please open an issue on GitHub.
If you use VLA-Mark in your research, please cite our paper:
@misc{liu2025vlamarkcrossmodalwatermark,
title={VLA-Mark: A cross modal watermark for large vision-language alignment model},
author={Shuliang Liu and Qi Zheng and Jesse Jiaxi Xu and Yibo Yan and He Geng and Aiwei Liu and Peijie Jiang and Jia Liu and Yik-Cheung Tam and Xuming Hu},
year={2025},
eprint={2507.14067},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.14067}
}