Skip to content

shiningwhite-cmd/VLA-mark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VLA-Mark: A cross modal watermark for large vision-language alignment models

Shuliang Liu1, 2, Qi Zheng1, 2, Jesse Jiaxi Xu3, Yibo Yan1, 2, He Geng1, 2, Aiwei Liu11, 2
Peijie Jiang4, Jia Liu4, Yik-Cheung Tam5, Xuming Hu1, 2
1 The Hong Kong University of Science and Technology (Guangzhou)
2 The Hong Kong University of Science and Technology
3 University of Toronto 4 Ant Group, Alibaba 5 New York University Shanghai

Abstract

Vision-language models demand watermarking solutions that protect intellectual property without compromising multimodal coherence. Existing text watermarking methods disrupt visual-textual alignment through biased token selection and static strategies, leaving semantic-critical concepts vulnerable. We propose VLA-Mark, a vision-aligned framework that embeds detectable watermarks while preserving semantic fidelity through cross-modal coordination. Our approach integrates multiscale visual-textual alignment metrics, combining localized patch affinity, global semantic coherence, and contextual attention patterns, to guide watermark injection without model retraining. An entropy-sensitive mechanism dynamically balances watermark strength and semantic preservation, prioritizing visual grounding during low-uncertainty generation phases. Experiments show 7.4% lower PPL and 26.6% higher BLEU than conventional methods, with near perfect detection (98.8% AUC). The framework demonstrates 96.1% attack resilience against attacks such as paraphrasing and synonym substitution, while maintaining text-visual consistency, establishing new standards for quality-preserving multimodal watermarking

VLA-mark

πŸ—οΈ Project Structure

VLA/
β”œβ”€β”€ main.py                 # Main watermarking script
β”œβ”€β”€ detection.py           # Watermark detection pipeline
β”œβ”€β”€ vla.py                 # Core VLA watermarking implementation
β”œβ”€β”€ mark_utils.py          # Utility functions for watermarking
β”œβ”€β”€ run.sh                 # Main execution script
β”œβ”€β”€ dataset/               # Dataset directory
β”‚   └── amber/            # AMBER dataset
β”‚       β”œβ”€β”€ image/        # Images for testing
β”‚       └── query_generative.json  # Query dataset
└── result/               # Results and outputs directory

πŸš€ Quick Start

Prerequisites

  • Python 3.11
  • PyTorch
  • Transformers
  • MarkLLM
  • PIL (Pillow)
  • pandas, numpy, scikit-learn
  • tqdm

Installation

# Clone the repository
git clone https://github.com/shiningwhite-cmd/VLA-mark.git
cd VLA-mark

# Install dependencies
pip install -r requirements.txt

Dataset Setup

This project uses the AMBER (An LLM-free Multi-dimensional Benchmark for MLLMs Evaluation) dataset for evaluation. AMBER is a comprehensive benchmark designed to evaluate multimodal large language models (MLLMs) across multiple dimensions without relying on LLMs for assessment.

Download AMBER Dataset

Since the dataset folder is empty in the repository, you need to download the AMBER dataset manually:

# Create dataset directory structure
mkdir -p dataset/amber

# Download the query file
curl -o dataset/amber/query_generative.json https://raw.githubusercontent.com/junyangwang0410/AMBER/master/data/query/query_generative.json

# Download and setup images
# Clone AMBER repository to get download scripts
git clone https://github.com/junyangwang0410/AMBER.git temp_amber

# Follow AMBER's instructions to download images
cd temp_amber
# Download images according to AMBER's README instructions
# This typically involves downloading from cloud storage services

# Move images to the correct location
mv path/to/downloaded/images ../dataset/amber/image/

# Clean up temporary files
cd ..
rm -rf temp_amber

Alternative manual setup:

  1. Query file: Download query_generative.json from: https://github.com/junyangwang0410/AMBER/blob/master/data/query/query_generative.json
  2. Images: Follow the download instructions at: https://github.com/junyangwang0410/AMBER
    • Download the image files according to the AMBER repository instructions
    • Extract and place all images in dataset/amber/image/ directory
    • Ensure images follow the naming convention: AMBER_{ID}.jpg

Verify Dataset Setup

# Check if dataset is properly set up
ls dataset/amber/
# Should show: image/ query_generative.json

# Check number of images (should be 1004 for full AMBER dataset)
ls dataset/amber/image/ | wc -l

# Verify query file format
head -5 dataset/amber/query_generative.json

Basic Usage

1. Set up environment variables

export MODEL_PATH=/path/to/your/model

2. Generate watermarked responses

bash run.sh

Or run directly with Python:

python3 main.py \
    --json_path ./dataset/amber/query_generative.json \
    --image_dir ./dataset/amber/image \
    --model_path "${MODEL_PATH}" \
    --range_num 1004 \
    --model_name llava \
    --task_name AMBER \
    --data_suffix .jpg \
    --similarity_scheme cosine \
    --max_tokens 100 \
    --min_tokens 90

πŸ“Š Evaluation Metrics

The system provides comprehensive evaluation metrics:

  • ROC AUC: Area Under the ROC Curve
  • F1 Score: Harmonic mean of precision and recall
  • Accuracy: Overall detection accuracy
  • TPR/FPR: True/False Positive Rates
  • TNR/FNR: True/False Negative Rates

🎯 Supported Models

  • LLaVA: llava-1.5-7b-hf
  • Qwen2-VL: Qwen2-VL-7B-Instruct

πŸ“ Dataset Format

Query Dataset (query_generative.json)

[
    {
        "id": 1,
        "image": "AMBER_1.jpg",
        "query": "Describe this image."
    },
    {
        "id": 2,
        "image": "AMBER_2.jpg",
        "query": "Describe this image."
    }
]

Image Dataset

  • Images should be placed in the specified image_dir
  • Supported formats: JPG, PNG
  • Naming convention: {TASK_NAME}_{ID}{DATA_SUFFIX}

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ”— References

πŸ“§ Contact

For questions, issues, or collaboration opportunities, please open an issue on GitHub.

πŸ“– Citation

If you use VLA-Mark in your research, please cite our paper:

@misc{liu2025vlamarkcrossmodalwatermark,
      title={VLA-Mark: A cross modal watermark for large vision-language alignment model},
      author={Shuliang Liu and Qi Zheng and Jesse Jiaxi Xu and Yibo Yan and He Geng and Aiwei Liu and Peijie Jiang and Jia Liu and Yik-Cheung Tam and Xuming Hu},
      year={2025},
      eprint={2507.14067},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.14067}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •