GitHub - zli12321/MM-Zero: Self-evolving vision language models from zero data

MM-Zero: Multimodal Self-Play for Vision-Language Models

Installation • Training • Evaluation • Visualization • Checkpoints • Citation

MM-Zero is a self-evolving reinforcement learning framework that improves vision-language models (VLMs) without requiring any human-annotated image data. It co-evolves three specialized agents — Proposer, CodeGen, and Solver — in an iterative loop where each agent bootstraps training signal for the others.

How It Works

Each self-play iteration trains three models in sequence:

Proposer — generates diverse visual reasoning questions
CodeGen — writes SVG code that renders into images for those questions
Solver — learns to answer the generated visual questions via GRPO

At iteration i, each model evolves from its version at iteration i−1, creating a curriculum that grows in difficulty and diversity over time. The SVG rendering pipeline (cairosvg → PNG) is fully deterministic and requires no external image sources.

Built on EasyR1 / veRL.

Installation

1. Create a conda environment

conda create -n mm-zero python=3.12
conda activate mm-zero

2. Clone and install

git clone https://github.com/zli12321/MM-Zero.git
cd MM-Zero
bash setup.sh

setup.sh installs PyTorch (CUDA 12.8), vLLM, flash-attention, and all dependencies. It will also prompt you to log in to Weights & Biases for experiment tracking.

Hardware Requirements

8× GPUs (80 GB each recommended, e.g., A100/H100)
- 2 GPUs for GRPO training + 6 GPUs for vLLM inference (Proposer/CodeGen phases)
- All 8 GPUs for Solver GRPO training
40 GB GPUs are supported by setting GPU_MEM=40

Training

Launch the full self-play pipeline with a single command:

## Qwen3-VL-8B-Instruct
bash ./scripts/main_svg.sh

## Qwen3-VL-4B-Instruct
bash ./scripts/main_qwen3vl_4b.sh

This runs the complete iterative loop (Proposer → CodeGen → Solver) for multiple iterations, starting from a base model. Each iteration builds on the previous one's checkpoints.

Configuration

Key environment variables (all have sensible defaults):

Variable	Default	Description
`STORAGE_PATH`	`/workspace/selfAgent_Storage_svg_long_round6_filter`	Output directory for all checkpoints, proposals, and images
`Base_model`	`Qwen/Qwen3-VL-8B-Instruct`	HuggingFace model ID or local path
`GPU_MEM`	`80`	GPU memory tier in GB (`40` or `80`)
`TRAIN_STEPS`	`20`	Training steps per model per iteration

Example with custom settings:

STORAGE_PATH=/my/experiment \
Base_model=Qwen/Qwen3-VL-8B-Instruct \
GPU_MEM=80 \
TRAIN_STEPS=20 \
bash MM-zero_final/scripts/main_svg.sh

The pipeline is resumable — it automatically detects existing checkpoints and skips completed phases.

Other Training Scripts

Script	Description
`MM-zero_final/scripts/main_svg.sh`	Full VLM self-evolving pipeline for Qwen3-VL-8B
`MM-zero_final/scripts/main_qwen3vl_4b.sh`	Full VLM self-evolving pipeline for Qwen3-VL-4B
`MM-zero_final/scripts/main_svg_mino.sh`	Full VLM self-evolving pipeline for Mimo-VL-7B-SFT
`MM-zero_final/scripts/proposer_train.sh`	Proposer-only training
`MM-zero_final/scripts/codegen_train.sh`	CodeGen-only training
`MM-zero_final/scripts/solver_train.sh`	Solver-only training

Evaluation

After training, evaluate the solver checkpoints on 12 multimodal benchmarks:

STORAGE=/path/to/your/storage bash run_eval.sh

The STORAGE path should point to the same directory used during training (i.e., STORAGE_PATH). The script automatically discovers all solver checkpoints under STORAGE/models/ and evaluates them with 8-way data parallelism.

Evaluated benchmarks: MMSI, MathVerse, MathVision, MathVista, MM-Vet, MMMU-Pro (4-option), VisNumBench, MMMU-Pro (10-option), MMMU-Pro-Vision, HallusionBench, MMMU, ChartQA.

Evaluation outputs are saved to STORAGE/eval_responses/, including per-model accuracy breakdowns and an LLM judge pass using Qwen2.5-14B-Instruct.

Visualization

Compare evaluation accuracy across training iterations vs. the base model:

python eval_accuracy_comparison.py STORAGE_PATH/eval_responses/llm_accuracy_summary.jsonl

To plot co-evolution metrics (difficulty, solvability, diversity, render success rate, etc.):

python plot_coevolution.py \
    --storage_dirs /path/to/your/storage \
    --model_name "Qwen3-VL-8B-Instruct"

Pre-trained Checkpoints

Pre-trained checkpoints and full training logs for Qwen3-VL-8B-Instruct are available on Hugging Face:

Resource	Link
Training logs & eval results	IntelligenceLab/MM-Zero-Logs

The logs include all model checkpoints across iterations and evaluation results on all 12 benchmarks.

To download to a specific folder:

huggingface-cli download IntelligenceLab/MM-Zero-Logs --local-dir /path/to/your/storage

Then point STORAGE to that folder when running evaluation or visualization.

Project Structure

Self-Agent/
├── MM-zero_final/
│   ├── scripts/              # Training orchestration scripts
│   ├── proposal_generate/    # Proposer inference & data generation
│   ├── code_generate/        # CodeGen inference
│   ├── code_render/          # SVG → PNG rendering pipeline
│   ├── question_evaluate/    # Question quality evaluation
│   ├── reward_function/      # GRPO reward functions
│   ├── configs/              # Training configurations
│   └── data/                 # Data utilities
├── verl/                     # Modified veRL/EasyR1 training engine
├── eval_generate.py          # Benchmark evaluation inference
├── llm_judge_eval.py         # LLM-based judge evaluation
├── eval_accuracy_comparison.py  # Accuracy comparison & plotting
├── plot_coevolution.py       # Co-evolution metric visualization
├── run_eval.sh               # Full evaluation pipeline
└── setup.sh                  # Environment setup

License

This project is released under the Apache 2.0 License.

Citation

@misc{li2026mmzeroselfevolvingmultimodelvision,
      title={MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data}, 
      author={Zongxia Li and Hongyang Du and Chengsong Huang and Xiyang Wu and Lantao Yu and Yicheng He and Jing Xie and Xiaomin Wu and Zhichao Liu and Jiarui Zhang and Fuxiao Liu},
      year={2026},
      eprint={2603.09206},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.09206}, 
}

Acknowledgements

EasyR1 and veRL for the RL training framework
vLLM for efficient inference

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github		.github
MM-zero-draft		MM-zero-draft
MM-zero_final		MM-zero_final
MM-zero_noFilter		MM-zero_noFilter
assets		assets
examples		examples
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Dockerfile.legacy		Dockerfile.legacy
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
eval_accuracy_comparison.py		eval_accuracy_comparison.py
eval_generate.py		eval_generate.py
llm_judge_eval.py		llm_judge_eval.py
plot_coevolution.py		plot_coevolution.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_eval.sh		run_eval.sh
setup.py		setup.py
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MM-Zero: Multimodal Self-Play for Vision-Language Models

How It Works

Installation

Hardware Requirements

Training

Configuration

Other Training Scripts

Evaluation

Visualization

Pre-trained Checkpoints

Project Structure

License

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MM-Zero: Multimodal Self-Play for Vision-Language Models

How It Works

Installation

Hardware Requirements

Training

Configuration

Other Training Scripts

Evaluation

Visualization

Pre-trained Checkpoints

Project Structure

License

Citation

Acknowledgements

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages