How to Reach Us:
- Code Issues: Please open an issue in our GitHub repository for any problems or bugs.
- General Inquiries: Contact Dian Zheng at zhengd35 [at] mail2 [at] sysu [at] edu [at] cn.
This repository contains the implementation of the following paper.
Architecture Decoupling Is Not All You Need For Unified Multimodal Model
Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, Peng Pei,Xunliang Cai, Hongsheng Li+
- [11/2025] 🔥 We released the training, inference, evaluation code, checkpoint of AIA! 🔥
Overview of AIA. While the community has primarily focused on data quality, data mixture ratios, and architectural decoupling strategies for unified multimodal models, we are the first to analyze the underlying mechanisms and explore the way to narrow the gap between purly architecture and BAGEL like ones. We discover that architectural decoupling does not fundamentally resolve the conflicts between generation and understanding tasks, but rather drives the multimodal interaction patterns closer to those of single-task models. Based on this insight, we propose Attention Interaction Alignment (AIA), a method that explicitly constrains interaction patterns during training without requiring architectural decoupling. Our approach achieves performance improvements on both Emu3 and Janus-Pro, demonstrating its effectiveness in alleviating task conflicts.
0. Environment Preparation
First please clone our repo and prepare the python environment. We recommend using Python>=3.10.
git clone https://github.com/zhengdian1/AIA.git
cd AIA
conda create -n janus-pro-aia python=3.11
conda activate janus-pro-aia
pip install -r requirements.txt1. Training Configuration
Before starting the training, you need to prepare a configuration file in advance. We provide an example for reference: configs/t2i_generation.yml. This YAML configuration file defines the training settings for SFT. It includes sections for general training setup, optimization strategies, model paths, and data loading.
To run the training code, you need to specify the following parameters:
output_path: Path to save model checkpoints and outputs.pre_path: Path to training resume.log_path: Path to store training logs.model_path: Path to the pretrained model.processor_path: Path to the processor.und_data_path: Path to understanding training data.gen_data_path: Path to generation training data.
2. Prepare Training Data
We provide an example data sample to clarify the required format for training data.
Specifically, for text-to-image, each data sample should follow the format below:
{
"conversations": [{"from": "human", "value": "a photo of a cat"}, {"from": "gpt", "value": "<image>"}],
"image": "path"
}For image understanding, each data sample should follow the format below:
{
"image": "path",
"conversations": [{"from": "human", "value": "Is there a cat in the image? Please answer yes or no."}, {"from": "gpt", "value": "yes"}]
}3. Training
Next, we provide two types of training scripts, you can choose the one suitable for your situation.
If you train on a single node, use the scropt below:
python launch.py --args_yml_fn configs/t2i_generation.ymlIf you train on multi nodes, use the scropt below:
bash run.shInference
First, downloading our checkpoint
Then, we provide inference codes for result or our proposed cross-modal interaction pattern plot.
If you want to output the result, use the scropt below:
python generation_inference.py --ckpt_path your_path --prompt 'A cute cat.'
python interactivechat.py --ckpt_path your_path --prompt "Describe the image in detail" --image_path /path/to/image.jpgIf you want to see the cross-modal interaction pattern plot, use the scropt below:
python gen_plot.py --ckpt_path your_path --prompt 'A cute cat.'
python und_plot.py --ckpt_path your_path --prompt "Describe the image in detail" --image_path /path/to/image.jpgEvaluation
We provide the evaluation code on widely used visual understanding and generation benchmarks below.
For visual understanding evaluation (MMMU, MME, MMVP, MMBench, POPE, MMVet), please firstly downloading the corresponding dataset from data and then following the scripts below:
cd evaluation
bash scripts/eval/evaluate.shNote that for MMBench and MMVet, you need to submit the result to the official website: MMBench, MMVet
For visual generation evaluation (DPG, GenEval), please refer to UlmEvalKit, which will be more efficient. Additionaly, we use the long text prompt for GenEval from BAGEL
If you find our repo useful for your research, please consider citing our paper:
@article{zheng2025architecture,
title={Architecture Decoupling Is Not All You Need For Unified Multimodal Model},
author={Zheng, Dian and Zhang, Manyuan and Li, Hongyu and Zou, Kai and Liu, Hongbo and Guo, Ziyu and Feng, Kaituo and Liu, Yexin and Luo, Ying and Feng, Yan and Pei, Peng and Cai, Xunliang and Li, Hongsheng},
journal={arXiv preprint arXiv:2511.22663},
year={2025}
}This project wouldn't be possible without the following open-sourced repositories: Janus-Pro, Janus-Pro-R1, BAGEL, Emu3.
Our related projects: Uni-MMMU, VBench-2.0
@article{zou2025uni,
title={Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark},
author={Zou, Kai and Huang, Ziqi and Dong, Yuhao and Tian, Shulin and Zheng, Dian and Liu, Hongbo and He, Jingwen and Liu, Bin and Qiao, Yu and Liu, Ziwei},
journal={arXiv preprint arXiv:2510.13759},
year={2025}
}
@article{zheng2025vbench2,
title={{VBench-2.0}: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness},
author={Zheng, Dian and Huang, Ziqi and Liu, Hongbo and Zou, Kai and He, Yinan and Zhang, Fan and Zhang, Yuanhan and He, Jingwen and Zheng, Wei-Shi and Qiao, Yu and Liu, Ziwei},
journal={arXiv preprint arXiv:2503.21755},
year={2025}
}