Cheers : Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

🤗 Hugging Face | 📑 Paper

Yichen Zhang^1*, Da Peng^2*, Zonghao Guo^1†, Zijian Zhang³, Xuesong Yang³,

Tong Sun³, Shichu Sun³, Yidan Zhang³, Yanghao Li¹, Haiyan Zhao¹, Wang Xu¹,

Qi Shi¹, Yangang Sun¹, Chi Chen¹, Shuo Wang¹, Yukun Yan¹, Xu Han¹,

Qiang Ma¹, Wei Ke², Liang Wang³, Zhiyuan Liu¹, Maosong Sun¹

¹Tsinghua University, ²Xi'an Jiaotong University, ³University of Chinese Academy of Sciences

* Equal contribution † Corresponding author

🔥 News

[2026/03/25] 🔥 Our training framework supports image editing training. Please refer to format.jsonl for data organization.
[2026/03/19] 🎉 Demo is now available on Hugging Face. Thanks to Prithiv Sakthi for setting it up!
[2026/03/16] 📢 The Cheers paper is officially released.
[2026/03/16] 🛠 We open-source the evaluation code and training pipeline. Our codebase is highly efficient: training on 3.8M samples takes only about two days on a single machine with 8×A100 GPUs.
[2026/03/16] 📦 The model checkpoints of Cheers are now available.

🌟 What is Cheers?

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling.

🧱 Model Architecture

🚀 Quick Start

Set up a new virtual environment

conda create -n cheers python=3.11 -y
conda activate cheers
pip install -r requirements.txt

Inference

import os
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
ckpt = "Your Local Checkpoints Path"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = AutoProcessor.from_pretrained(ckpt, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(ckpt, trust_remote_code=True)
model.to(device)
model = model.to(torch.bfloat16)
model.eval()

1️⃣ Text-to-image generation

content = "Your instruction."
images_batch = [None]

2️⃣ Image Understanding

content = "<im_start><image><im_end>\n Your instruction."
img = Image.open("image_path")
images_batch = [img,]

3️⃣ Text-only Question Answering

content = "Your instruction."
images_batch = [None]

Then run the following code:

messages_batch = [[{"role": "user", "content": content} ],]
texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages_batch]
inputs = processor(text=texts, images=images_batch, return_tensors="pt", add_im_start_id=False)
inputs = {k: (v.to(device=device) if isinstance(v, torch.Tensor) else v) for k, v in inputs.items()}

gen_config = {
    "max_length": 300,
    "cfg_scale": 9.5, # if generation
    "temperature": 0.0,
    "num_inference_steps": 80, # if use# if generation
    "alpha": 0.5, # if generation
    "edit_image": False # if generation
    }

inputs.update(gen_config)
generated = model.generate(**inputs)
input_ids = generated["input_ids"]

images = generated["images"][0] # if generation
current_img = images[0] # if generation
current_img = current_img.clamp(0.0, 1.0) # if generation
save_image(current_img, f"outputs/case_.png") # if generation
print(processor.tokenizer.batch_decode(input_ids, skip_special_tokens=True)) # if understanding or text-only qa

Alternatively, you can directly run the code in Inference/ for a quick demo.

🔧 Training

Please follow the VeOmni framework guidelines to set up the training environment. The training workspace is located in the Training/ directory. Of course, you can also directly enter the Training/ directory and run pip install -e . to quickly set up the training environment. Then you can run the following scripts:

bash train_align.sh # for alignment

or

bash train_sft.sh # for training all parameters except the VAE.

Notably, the training data format can follow the template at Training/data/format.jsonl. Please also remember to update the training configuration in Training/configs/multimodal/cheers/und_gen_train/.

After training, please replace the config.json file in the output directory with Training/cheers_config/config.json to ensure correct evaluation.

📊 Evaluation

Understanding

Please follow the VLMEvalKit framework guidelines to set up the evaluation environment. The evaluation workspace is located in the Evaluation_Understanding/ directory. Then you can run the following scripts:

torchrun --master_port=19507 --nproc-per-node=8  run.py --data \
    MathVista_MINI MMBench_DEV_EN_V11 SEEDBench_IMG \
    MMStar POPE RealWorldQA MMMU_DEV_VAL ScienceQA_TEST  \
    AI2D_TEST OCRBench TextVQA_VAL ChartQA_TEST \
    --model Cheers --verbose

Similarly, you can directly run the following script to perform the evaluation:

bash eval.sh

Please make sure to update the dataset path in eval.sh and the model path in vlmeval/config.py before running the script.

GenEval

Please follow the GenEval framework guidelines to set up the GenEval evaluation environment. The evaluation workspace is located in the Evaluation_GenEval/ directory. Before running the evaluation, please download the Mask2Former object detector and place it in models. Then you can run the following scripts:

bash generation/run_short.sh

or

bash generation/run_long.sh # for rewrite prompt

Use the follow script to get the final score:

bash calculate.sh

DPGBench

Please follow the ELLA framework guidelines to set up the DPGBench evaluation environment. Before running the evaluation, please download the mplug and place it in benchmarks/dpg/mplug_visual-question-answering_coco_large_en. Then you can run the following scripts:

bash Evaluation_DPGBench/scripts/dpg_gen.sh

Then:

bash Evaluation_DPGBench/scripts/dpg_eval.sh # Remember to replace the image folder

🧩 To-Do List

Release the Inference Scripts and Checkpoints
Release the Training Scripts using the VeOmni framework
Release the Evaluation Scripts
Release the Training Data
Release Cheers v1.1 — maintaining strong understanding performance while further improving generation quality

🙏 Acknowledgement

This repo benefits from VeOmni, VLMEvalKit , GenEval and ELLA. Thanks for their wonderful works.

📬 Contact

For any questions or collaborations, feel free to contact us : )

📧 guozonghao96@outlook.com | 📧 yichen0zhang@gmail.com | 📧 MetaPDa@gmail.com

📖 Citation

If you find Cheers useful, please cite Cheers technical report using this BibTeX.

@article{zhang2026cheers,
  title={CHEERS: DECOUPLING PATCH DETAILS FROM SEMANTIC REPRESENTATIONS ENABLES UNIFIED MULTIMODAL COMPREHENSION AND GENERATION},
  author={Zhang, Yichen and Peng, Da and Guo, Zonghao and Zhang, Zijian and Yang, Xuesong and Sun, Tong and Sun, Shichu and Zhang, Yidan and Li, Yanghao and Zhao, Haiyan and others},
  journal={arXiv preprint arXiv:2603.12793},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cheers : Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

🔥 News

🌟 What is Cheers?

🧱 Model Architecture

🚀 Quick Start

Set up a new virtual environment

Inference

🔧 Training

📊 Evaluation

Understanding

GenEval

DPGBench

🧩 To-Do List

🙏 Acknowledgement

📬 Contact

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
Evaluation_DPGBench		Evaluation_DPGBench
Evaluation_GenEval		Evaluation_GenEval
Evaluation_Understanding		Evaluation_Understanding
Inference		Inference
Training		Training
fig		fig
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Cheers : Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

🔥 News

🌟 What is Cheers?

🧱 Model Architecture

🚀 Quick Start

Set up a new virtual environment

Inference

🔧 Training

📊 Evaluation

Understanding

GenEval

DPGBench

🧩 To-Do List

🙏 Acknowledgement

📬 Contact

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages