Zhekai Chen1, Yuqing Wang1, Manyuan Zhang†2, Xihui Liu†1
1HKU MMLab 2Meituan
†Corresponding authors
Macro is a multi-reference image generation dataset and benchmark. It covers four task categories — Customization, Illustration, Spatial, and Temporal — across four image-count brackets (1–3, 4–5, 6–7, ≥8 reference images). Alongside the dataset we provide fine-tuned checkpoints for three open-source models: Bagel, OmniGen2, and Qwen-Image-Edit.
Three ready-to-use test scripts are under scripts/. Each script falls back to a built-in sample (assets/test_example/4-5_sample1.json) when no prompt or images are given.
| Checkpoint | HuggingFace |
|---|---|
| Macro-Bagel | |
| Macro-OmniGen2 | |
| Macro-Qwen-Image-Edit |
huggingface-cli download Azily/Macro-Bagel --local-dir ckpts/Macro-Bagel
huggingface-cli download Azily/Macro-OmniGen2 --local-dir ckpts/Macro-OmniGen2
huggingface-cli download Azily/Macro-Qwen-Image-Edit --local-dir ckpts/Macro-Qwen-Image-EditAll three test scripts share these arguments:
| Argument | Default | Description |
|---|---|---|
--prompt |
(from sample) | Text instruction |
--input_images |
(from sample) | One or more reference image paths |
--output |
outputs/test_<model>.jpg |
Output image path |
--height / --width |
768 |
Output resolution |
--seed |
42 |
Random seed |
# Default sample
python scripts/test_bagel.py
# Custom inputs
python scripts/test_bagel.py \
--model_path ckpts/Macro-Bagel \
--prompt "Generate an image of …" \
--input_images img1.jpg img2.jpg| Model Argument | Default | Description |
|---|---|---|
--model_path |
ckpts/Macro-Bagel |
Fine-tuned Bagel checkpoint (also used as base model) |
--base_model_path |
(same as --model_path) |
Override base model path (or env BAGEL_BASE_MODEL_PATH) |
# Default sample
python scripts/test_omnigen.py
# Custom inputs
python scripts/test_omnigen.py \
--model_path ckpts/Macro-OmniGen2 \
--transformer_path ckpts/Macro-OmniGen2/transformer \
--prompt "Generate an image of …" \
--input_images img1.jpg img2.jpg| Model Argument | Default | Description |
|---|---|---|
--model_path |
ckpts/Macro-OmniGen2 |
OmniGen2 base model dir (or env OMNIGEN_MODEL_PATH) |
--transformer_path |
ckpts/Macro-OmniGen2/transformer |
Fine-tuned transformer; omit to use the base model |
--enable_model_cpu_offload |
False |
CPU offload to reduce GPU memory |
# Default sample
python scripts/test_qwen.py
# Custom inputs
python scripts/test_qwen.py \
--model_base_path ckpts \
--model_id Macro-Qwen-Image-Edit \
--prompt "Generate an image of …" \
--input_images img1.jpg img2.jpg| Model Argument | Default | Description |
|---|---|---|
--model_base_path |
ckpts |
Root dir containing model sub-folders (or env DIFFSYNTH_MODEL_BASE_PATH) |
--model_id |
Macro-Qwen-Image-Edit |
Sub-directory name under model_base_path |
Each model has a dedicated inference/config.yaml. Edit it to set checkpoint paths and tasks, then run:
# Run all checkpoints in the config
python bagel/inference/run.py --config bagel/inference/config.yaml
python omnigen/inference/run.py --config omnigen/inference/config.yaml
python qwen/inference/run.py --config qwen/inference/config.yaml
# Run a single checkpoint / task / category
python bagel/inference/run.py --config bagel/inference/config.yaml \
--ckpt bagel_macro --task customization --category 4-5The config files share a common structure:
global_config:
output_root: ./outputs/<model> # where generated images are saved
data_root: ./data/filter # evaluation data root
# YAML anchors for reusable task/category selections
common_tasks:
all_tasks: &all_tasks
customization: ["1-3", "4-5", "6-7", ">=8"]
illustration: ["1-3", "4-5", "6-7", ">=8"]
spatial: ["1-3", "4-5", "6-7", ">=8"]
temporal: ["1-3", "4-5", "6-7", ">=8"]
checkpoints:
my_checkpoint:
path: ./ckpts/... # checkpoint path
# transformer_path: ... # optional: fine-tuned transformer (OmniGen2 / Qwen)
# base_model_path: ... # optional: override base model path (Bagel only)
tasks: *all_tasksResults are written to outputs/<model>/<checkpoint_name>/<task>/<category>/.
Dynamic resolution — input images are automatically resized based on reference image count:
[1M, 1M, 590K, 590K, 590K, 262K, 262K, 262K, 262K, 262K]pixels for 1, 2, 3 … 10+ images.
Option A — environment variables:
# GPT / OpenAI-compatible endpoint
export OPENAI_URL=https://api.openai.com/v1/chat/completions
export OPENAI_KEY=your_openai_api_key
# Gemini (default model: gemini-3.0-flash-preview)
export GEMINI_API_KEY=your_gemini_api_key
export GEMINI_MODEL_NAME=gemini-3.0-flash-preview # optionalOption B — eval/config.yaml (overrides environment variables):
global_config:
api_config:
openai:
url: "https://api.openai.com/v1/chat/completions"
key: "your_openai_api_key"
gemini:
api_key: "your_gemini_api_key"
model_name: "gemini-3.0-flash-preview"Edit eval/config.yaml:
global_config:
output_root: ./outputs # must match inference output_root
use_gpt: false # set true to enable GPT scoring
use_gemini: true # set true to enable Gemini scoring
parallel_workers: 16
evaluations:
bagel:
bagel_macro:
tasks: *all_tasks
omnigen:
omnigen2_macro:
tasks: *all_tasks
qwen:
qwen_macro:
tasks: *all_tasks# Run all configured evaluations
python eval/run_eval.py
# Run a specific model / experiment
python eval/run_eval.py --baseline bagel --exp bagel_macro
python eval/run_eval.py --baseline omnigen --exp omnigen2_macro
python eval/run_eval.py --baseline qwen --exp qwen_macroScores are saved as JSON files alongside the generated images in outputs/<model>/<exp>/<task>/<category>/.
The Macro dataset is available on Hugging Face as a collection of .tar.gz archives:
# Download all archives into data_tar/
huggingface-cli download Azily/Macro-Dataset --repo-type dataset --local-dir data_tar/Selective download: If you only need the evaluation benchmark (JSON index files, no images), download just
filter.tar.gz(~510 MB):huggingface-cli download Azily/Macro-Dataset \ --repo-type dataset \ --include "filter.tar.gz" \ --local-dir data_tar/To download a specific task/split/category (e.g., customization train 1–3 images):
huggingface-cli download Azily/Macro-Dataset \ --repo-type dataset \ --include "final_customization_train_1-3.tar.gz" \ --local-dir data_tar/
An extraction script extract_data.sh is included in the downloaded data_tar/ folder. Run it from the project root to restore the original data/ tree:
bash data_tar/extract_data.sh ./data_tar .
# Restores: ./data/filter/, ./data/final/, ./data/raw/Or extract all archives manually:
for f in data_tar/*.tar.gz; do tar -xzf "$f" -C .; doneAfter extraction, the data/ directory has the following layout:
data/
├── filter/ # JSON index files for training & evaluation
│ ├── customization/
│ │ ├── train/
│ │ │ ├── 1-3/ *.json # 20,000 training samples (1–3 ref images)
│ │ │ ├── 4-5/ *.json # 20,000 training samples
│ │ │ ├── 6-7/ *.json # 30,000 training samples
│ │ │ └── >=8/ *.json # 30,000 training samples
│ │ └── eval/
│ │ ├── 1-3/ *.json # 250 evaluation samples
│ │ ├── 4-5/ *.json # 250 evaluation samples
│ │ ├── 6-7/ *.json # 250 evaluation samples
│ │ └── >=8/ *.json # 250 evaluation samples
│ ├── illustration/ (same layout; 25,000 train / 250 eval per category)
│ ├── spatial/ (same layout)
│ └── temporal/ (same layout)
├── final/ # Actual image data, referenced by filter/ JSONs
│ ├── customization/
│ │ ├── train/
│ │ │ ├── 1-3/
│ │ │ │ ├── data/
│ │ │ │ │ ├── 00000000/
│ │ │ │ │ │ ├── image_1.jpg
│ │ │ │ │ │ ├── image_2.jpg (etc.)
│ │ │ │ │ │ └── image_output.jpg
│ │ │ │ │ └── ...
│ │ │ │ └── json/ *.json (raw generation metadata)
│ │ │ ├── 4-5/ ...
│ │ │ ├── 6-7/ ...
│ │ │ └── >=8/ ...
│ │ └── eval/ ...
│ ├── illustration/ ...
│ ├── spatial/ ...
│ └── temporal/ ...
└── raw/
├── t2i_example/
│ ├── t2i_example.jsonl # Placeholder T2I prompts (training format reference)
│ └── images/ # Placeholder images
└── customization/ # Original source images used in data construction
├── cloth/ *.jpg
├── human/ *.jpg
├── object/ *.jpg
├── scene/ *.jpg
├── style/ *.jpg
└── *_train.jsonl / *_eval.jsonl
Each JSON file in data/filter/ contains:
{
"task": "customization",
"idx": 1,
"prompt": "Create an image of the modern glass and metal interior from <image 2>, applying the classical oil painting style from <image 1> globally across the entire scene.",
"input_images": [
"data/final/customization/train/1-3/data/00022018/image_1.jpg",
"data/final/customization/train/1-3/data/00022018/image_2.jpg"
],
"output_image": "data/final/customization/train/1-3/data/00022018/image_output.jpg"
}Note: All
input_imagesandoutput_imagepaths are relative to the parent directory ofdata/(i.e., the project root). Place the extracteddata/folder at the project root before running training or inference.
⚠️ This section is for reference only. The pipeline depends on internal cluster infrastructure and proprietary APIs that are not publicly available. It cannot be run directly as-is.
data/raw/ → data/split/ → data/final/ → data/filter/
(raw) (split) (generated) (filtered, for training/eval)
Stage 1 — Split: split raw source data by task and image-count category.
python data_preprocess/split/customization.py
python data_preprocess/split/illustration.py
python data_preprocess/split/spatial.py
python data_preprocess/split/temporal.pyStage 2 — Generate: produce instructions and synthetic target images with Gemini (text) and Nano (image) APIs.
python data_preprocess/gen/customization.py
python data_preprocess/gen/illustration.py
python data_preprocess/gen/spatial.py
python data_preprocess/gen/temporal.pyStage 3 — Filter: score generated samples and keep those above a quality threshold.
python data_preprocess/filter/customization.py
python data_preprocess/filter/illustration.py
python data_preprocess/filter/spatial.py
python data_preprocess/filter/temporal.pyEach base model has its own training framework. Please follow the respective official installation guide before proceeding.
| Model | Framework | Installation |
|---|---|---|
| Bagel | ByteDance-Seed/Bagel | Follow the official repo's environment setup |
| OmniGen2 | VectorSpaceLab/OmniGen2 | Follow the official repo's environment setup |
| Qwen-Image-Edit | modelscope/DiffSynth-Studio | Install DiffSynth-Studio, then additionally install DeepSpeed: |
# Qwen-Image-Edit: after installing DiffSynth-Studio
pip install deepspeedhuggingface-cli download ByteDance-Seed/BAGEL-7B-MoT --local-dir ckpts/BAGEL-7B-MoT
huggingface-cli download Qwen/Qwen2.5-7B-Instruct --local-dir ckpts/Qwen2.5-7B-Instruct
huggingface-cli download OmniGen2/OmniGen2 --local-dir ckpts/OmniGen2
huggingface-cli download black-forest-labs/FLUX.1-dev --local-dir ckpts/FLUX.1-dev
huggingface-cli download Qwen/Qwen2.5-VL-3B-Instruct --local-dir ckpts/Qwen2.5-VL-3B-Instruct
huggingface-cli download Qwen/Qwen-Image-Edit-2511 --local-dir ckpts/Qwen-Image-Edit-2511The data/filter/ directory from Section 2 is used directly. Each {task}/train/{category}/ folder is picked up automatically by the training configuration scripts.
We do not provide T2I (text-to-image) pretraining data. Supply your own JSONL file:
{"task_type": "t2i", "instruction": "A photo of …", "output_image": "path/to/image.jpg"}Point to it in the model's config.yaml:
global:
t2i_data_path: data/raw/t2i_example/t2i_example.jsonl # ← replace with your path
default_t2i_ratio: 0.2 # fraction of T2I samples in a mixed batchAn illustrative placeholder is provided at data/raw/t2i_example/t2i_example.jsonl.
Set use_t2i: false in an experiment to skip T2I mixing entirely.
Each model uses a two-file workflow:
{model}/config.yaml— high-level experiment settings{model}/process_config.py— reads the config, preprocesses data, and writes a self-containedrun.shplus model-specific YAML files into{model}/exps/{exp_name}/
Key fields in config.yaml:
global:
multiref_data_root: data/filter # downloaded filter data
t2i_data_path: data/raw/t2i_example/t2i_example.jsonl
default_learning_rate: 2e-5 # default; overridable per experiment
default_num_epochs: 10
experiments:
my_experiment:
use_t2i: true # mix T2I data into training
t2i_ratio: 0.1 # 10 % of batches are T2I
learning_rate: 2e-5 # optional per-experiment override
data_config:
customization:
"1-3": {data_num: 1000}
"4-5": {data_num: 1000}
spatial: ...
illustration: ...
temporal: ...Run process_config.py:
python bagel/process_config.py --exp_name my_experiment
python omnigen/process_config.py --exp_name my_experiment
python qwen/process_config.py --exp_name my_experimentGenerated files for each model:
bagel/exps/my_experiment/
├── run.sh
└── dataset_config.yaml
omnigen/exps/my_experiment/
├── run.sh
├── X2I2.yml # OmniGen2 training config
├── ic.yml # multi-reference data list
├── t2i.yml # T2I data list
└── mix.yml # mixed dataset config
qwen/exps/my_experiment/
├── run.sh
└── metadata/ # per-sample metadata
bash bagel/exps/my_experiment/run.sh
bash omnigen/exps/my_experiment/run.sh
bash qwen/exps/my_experiment/run.shThe scripts use torchrun and support multi-node setups via the AFO_ENV_CLUSTER_SPEC environment variable. Single-node training works with defaults.
Checkpoints are saved to {model}/exps/{exp_name}/results/checkpoints/.
To evaluate a trained checkpoint, add it to {model}/inference/config.yaml and follow Section 1.2.
We thank the authors and contributors of the following open-source projects that this work builds upon:
- Bagel — ByteDance Seed's unified multimodal generation framework, which serves as one of our base models.
- OmniGen2 — VectorSpaceLab's versatile image generation model, which serves as one of our base models.
- Qwen-Image-Edit / DiffSynth-Studio — ModelScope's DiffSynth-Studio framework and the Qwen-Image-Edit model, which serve as one of our base models and training frameworks.
If you find this work useful, please cite:
@article{chen2026macroadvancingmultireferenceimage,
title = {MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data},
author = {Zhekai Chen and Yuqing Wang and Manyuan Zhang and Xihui Liu},
journal = {arXiv preprint arXiv:2603.25319},
year = {2026},
url = {https://arxiv.org/abs/2603.25319},
}
