Skip to content

zhengdian1/AIA

Repository files navigation

aia_logo

How to Reach Us:

  • Code Issues: Please open an issue in our GitHub repository for any problems or bugs.
  • General Inquiries: Contact Dian Zheng at zhengd35 [at] mail2 [at] sysu [at] edu [at] cn.

AIA Report (Arxiv) Project Page Visitors

This repository contains the implementation of the following paper.

Architecture Decoupling Is Not All You Need For Unified Multimodal Model
Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, Peng Pei,Xunliang Cai, Hongsheng Li+

🔥 Updates

  • [11/2025] 🔥 We released the training, inference, evaluation code, checkpoint of AIA! 🔥

📣 Overview

overall Overview of AIA. While the community has primarily focused on data quality, data mixture ratios, and architectural decoupling strategies for unified multimodal models, we are the first to analyze the underlying mechanisms and explore the way to narrow the gap between purly architecture and BAGEL like ones. We discover that architectural decoupling does not fundamentally resolve the conflicts between generation and understanding tasks, but rather drives the multimodal interaction patterns closer to those of single-task models. Based on this insight, we propose Attention Interaction Alignment (AIA), a method that explicitly constrains interaction patterns during training without requiring architectural decoupling. Our approach achieves performance improvements on both Emu3 and Janus-Pro, demonstrating its effectiveness in alleviating task conflicts.

🔨 Supervised Fine-Tuning

0. Environment Preparation

First please clone our repo and prepare the python environment. We recommend using Python>=3.10.

git clone https://github.com/zhengdian1/AIA.git
cd AIA

conda create -n janus-pro-aia python=3.11
conda activate janus-pro-aia
pip install -r requirements.txt

1. Training Configuration

Before starting the training, you need to prepare a configuration file in advance. We provide an example for reference: configs/t2i_generation.yml. This YAML configuration file defines the training settings for SFT. It includes sections for general training setup, optimization strategies, model paths, and data loading.

To run the training code, you need to specify the following parameters:

  • output_path: Path to save model checkpoints and outputs.
  • pre_path: Path to training resume.
  • log_path: Path to store training logs.
  • model_path: Path to the pretrained model.
  • processor_path: Path to the processor.
  • und_data_path: Path to understanding training data.
  • gen_data_path: Path to generation training data.

2. Prepare Training Data

We provide an example data sample to clarify the required format for training data.

Specifically, for text-to-image, each data sample should follow the format below:

{
  "conversations": [{"from": "human", "value": "a photo of a cat"}, {"from": "gpt", "value": "<image>"}], 
  "image": "path"
}

For image understanding, each data sample should follow the format below:

{
  "image": "path", 
  "conversations": [{"from": "human", "value": "Is there a cat in the image? Please answer yes or no."}, {"from": "gpt", "value": "yes"}]
}

3. Training

Next, we provide two types of training scripts, you can choose the one suitable for your situation.

If you train on a single node, use the scropt below:

python launch.py --args_yml_fn configs/t2i_generation.yml

If you train on multi nodes, use the scropt below:

bash run.sh

🏄 Inference && Evaluation

Inference

First, downloading our checkpoint

Then, we provide inference codes for result or our proposed cross-modal interaction pattern plot.

If you want to output the result, use the scropt below:

python generation_inference.py --ckpt_path your_path --prompt 'A cute cat.'

python interactivechat.py --ckpt_path your_path --prompt "Describe the image in detail" --image_path /path/to/image.jpg

If you want to see the cross-modal interaction pattern plot, use the scropt below:

python gen_plot.py --ckpt_path your_path --prompt 'A cute cat.'

python und_plot.py --ckpt_path your_path --prompt "Describe the image in detail" --image_path /path/to/image.jpg

Evaluation

We provide the evaluation code on widely used visual understanding and generation benchmarks below.

For visual understanding evaluation (MMMU, MME, MMVP, MMBench, POPE, MMVet), please firstly downloading the corresponding dataset from data and then following the scripts below:

cd evaluation
bash scripts/eval/evaluate.sh

Note that for MMBench and MMVet, you need to submit the result to the official website: MMBench, MMVet

For visual generation evaluation (DPG, GenEval), please refer to UlmEvalKit, which will be more efficient. Additionaly, we use the long text prompt for GenEval from BAGEL

✒️ Citation

If you find our repo useful for your research, please consider citing our paper:

 @article{zheng2025architecture,
   title={Architecture Decoupling Is Not All You Need For Unified Multimodal Model},
   author={Zheng, Dian and Zhang, Manyuan and Li, Hongyu and Zou, Kai and Liu, Hongbo and Guo, Ziyu and Feng, Kaituo and Liu, Yexin and Luo, Ying and Feng, Yan and Pei, Peng and Cai, Xunliang and Li, Hongsheng},
   journal={arXiv preprint arXiv:2511.22663},
   year={2025}
 }

♥️ Acknowledgement

🤗 Open-Sourced Repositories

This project wouldn't be possible without the following open-sourced repositories: Janus-Pro, Janus-Pro-R1, BAGEL, Emu3.

Related Links

Our related projects: Uni-MMMU, VBench-2.0

@article{zou2025uni,
    title={Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark},
    author={Zou, Kai and Huang, Ziqi and Dong, Yuhao and Tian, Shulin and Zheng, Dian and Liu, Hongbo and He, Jingwen and Liu, Bin and Qiao, Yu and Liu, Ziwei},
    journal={arXiv preprint arXiv:2510.13759},
    year={2025}
}

@article{zheng2025vbench2,
    title={{VBench-2.0}: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness},
    author={Zheng, Dian and Huang, Ziqi and Liu, Hongbo and Zou, Kai and He, Yinan and Zhang, Fan and Zhang, Yuanhan and He, Jingwen and Zheng, Wei-Shi and Qiao, Yu and Liu, Ziwei},
    journal={arXiv preprint arXiv:2503.21755},
    year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors