Cheers
: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation
π€ Hugging FaceΒ Β | Β Β π PaperΒ Β
Yichen Zhang1*, Da Peng2*, Zonghao Guo1β , Zijian Zhang3, Xuesong Yang3,
Tong Sun3, Shichu Sun3, Yidan Zhang3, Yanghao Li1, Haiyan Zhao1, Wang Xu1,
Qi Shi1, Yangang Sun1, Chi Chen1, Shuo Wang1, Yukun Yan1, Xu Han1,
Qiang Ma1, Wei Ke2, Liang Wang3, Zhiyuan Liu1, Maosong Sun1
1Tsinghua University, 2Xi'an Jiaotong University, 3University of Chinese Academy of Sciences
* Equal contribution β Corresponding author
- [2026/03/25] π₯ Our training framework supports image editing training. Please refer to
format.jsonlfor data organization. - [2026/03/19] π Demo is now available on Hugging Face. Thanks to Prithiv Sakthi for setting it up!
- [2026/03/16] π’ The Cheers paper is officially released.
- [2026/03/16] π We open-source the evaluation code and training pipeline. Our codebase is highly efficient: training on 3.8M samples takes only about two days on a single machine with 8ΓA100 GPUs.
- [2026/03/16] π¦ The model checkpoints of Cheers are now available.
A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling.
conda create -n cheers python=3.11 -y
conda activate cheers
pip install -r requirements.txtimport os
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
ckpt = "Your Local Checkpoints Path"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = AutoProcessor.from_pretrained(ckpt, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(ckpt, trust_remote_code=True)
model.to(device)
model = model.to(torch.bfloat16)
model.eval()1οΈβ£ Text-to-image generation
content = "Your instruction."
images_batch = [None]2οΈβ£ Image Understanding
content = "<im_start><image><im_end>\n Your instruction."
img = Image.open("image_path")
images_batch = [img,]3οΈβ£ Text-only Question Answering
content = "Your instruction."
images_batch = [None]Then run the following code:
messages_batch = [[{"role": "user", "content": content} ],]
texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages_batch]
inputs = processor(text=texts, images=images_batch, return_tensors="pt", add_im_start_id=False)
inputs = {k: (v.to(device=device) if isinstance(v, torch.Tensor) else v) for k, v in inputs.items()}
gen_config = {
"max_length": 300,
"cfg_scale": 9.5, # if generation
"temperature": 0.0,
"num_inference_steps": 80, # if use# if generation
"alpha": 0.5, # if generation
"edit_image": False # if generation
}
inputs.update(gen_config)
generated = model.generate(**inputs)
input_ids = generated["input_ids"]
images = generated["images"][0] # if generation
current_img = images[0] # if generation
current_img = current_img.clamp(0.0, 1.0) # if generation
save_image(current_img, f"outputs/case_.png") # if generation
print(processor.tokenizer.batch_decode(input_ids, skip_special_tokens=True)) # if understanding or text-only qaAlternatively, you can directly run the code in Inference/ for a quick demo.
Please follow the VeOmni framework guidelines to set up the training environment. The training workspace is located in the Training/ directory. Of course, you can also directly enter the Training/ directory and run pip install -e . to quickly set up the training environment. Then you can run the following scripts:
bash train_align.sh # for alignmentor
bash train_sft.sh # for training all parameters except the VAE.Notably, the training data format can follow the template at Training/data/format.jsonl. Please also remember to update the training configuration in Training/configs/multimodal/cheers/und_gen_train/.
After training, please replace the config.json file in the output directory with Training/cheers_config/config.json to ensure correct evaluation.
Please follow the VLMEvalKit framework guidelines to set up the evaluation environment. The evaluation workspace is located in the Evaluation_Understanding/ directory. Then you can run the following scripts:
torchrun --master_port=19507 --nproc-per-node=8 run.py --data \
MathVista_MINI MMBench_DEV_EN_V11 SEEDBench_IMG \
MMStar POPE RealWorldQA MMMU_DEV_VAL ScienceQA_TEST \
AI2D_TEST OCRBench TextVQA_VAL ChartQA_TEST \
--model Cheers --verbose Similarly, you can directly run the following script to perform the evaluation:
bash eval.shPlease make sure to update the dataset path in eval.sh and the model path in vlmeval/config.py before running the script.
Please follow the GenEval framework guidelines to set up the GenEval evaluation environment. The evaluation workspace is located in the Evaluation_GenEval/ directory. Before running the evaluation, please download the Mask2Former object detector and place it in models. Then you can run the following scripts:
bash generation/run_short.shor
bash generation/run_long.sh # for rewrite promptUse the follow script to get the final score:
bash calculate.shPlease follow the ELLA framework guidelines to set up the DPGBench evaluation environment. Before running the evaluation, please download the mplug and place it in benchmarks/dpg/mplug_visual-question-answering_coco_large_en. Then you can run the following scripts:
bash Evaluation_DPGBench/scripts/dpg_gen.shThen:
bash Evaluation_DPGBench/scripts/dpg_eval.sh # Remember to replace the image folder- Release the Inference Scripts and Checkpoints
- Release the Training Scripts using the VeOmni framework
- Release the Evaluation Scripts
- Release the Training Data
- Release Cheers v1.1 β maintaining strong understanding performance while further improving generation quality
This repo benefits from VeOmni, VLMEvalKit , GenEval and ELLA. Thanks for their wonderful works.
For any questions or collaborations, feel free to contact us : )
π§ guozonghao96@outlook.com Β Β | Β Β π§ yichen0zhang@gmail.com Β Β | Β Β π§ MetaPDa@gmail.com Β Β
If you find Cheers useful, please cite Cheers technical report using this BibTeX.
@article{zhang2026cheers,
title={CHEERS: DECOUPLING PATCH DETAILS FROM SEMANTIC REPRESENTATIONS ENABLES UNIFIED MULTIMODAL COMPREHENSION AND GENERATION},
author={Zhang, Yichen and Peng, Da and Guo, Zonghao and Zhang, Zijian and Yang, Xuesong and Sun, Tong and Sun, Shichu and Zhang, Yidan and Li, Yanghao and Zhao, Haiyan and others},
journal={arXiv preprint arXiv:2603.12793},
year={2026}
}
