Skip to content

AI9Stars/Cheers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

39 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Cheers : Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

πŸ€— Hugging FaceΒ Β  | Β Β πŸ“‘ PaperΒ Β 

Yichen Zhang1*, Da Peng2*, Zonghao Guo1†, Zijian Zhang3, Xuesong Yang3,

Tong Sun3, Shichu Sun3, Yidan Zhang3, Yanghao Li1, Haiyan Zhao1, Wang Xu1,

Qi Shi1, Yangang Sun1, Chi Chen1, Shuo Wang1, Yukun Yan1, Xu Han1,

Qiang Ma1, Wei Ke2, Liang Wang3, Zhiyuan Liu1, Maosong Sun1

1Tsinghua University, 2Xi'an Jiaotong University, 3University of Chinese Academy of Sciences

* Equal contribution † Corresponding author

πŸ”₯ News

  • [2026/03/25] πŸ”₯ Our training framework supports image editing training. Please refer to format.jsonl for data organization.
  • [2026/03/19] πŸŽ‰ Demo is now available on Hugging Face. Thanks to Prithiv Sakthi for setting it up!
  • [2026/03/16] πŸ“’ The Cheers paper is officially released.
  • [2026/03/16] πŸ›  We open-source the evaluation code and training pipeline. Our codebase is highly efficient: training on 3.8M samples takes only about two days on a single machine with 8Γ—A100 GPUs.
  • [2026/03/16] πŸ“¦ The model checkpoints of Cheers are now available.

🌟 What is Cheers?

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling.

🧱 Model Architecture

πŸš€ Quick Start

Set up a new virtual environment

conda create -n cheers python=3.11 -y
conda activate cheers
pip install -r requirements.txt

Inference

import os
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
ckpt = "Your Local Checkpoints Path"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = AutoProcessor.from_pretrained(ckpt, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(ckpt, trust_remote_code=True)
model.to(device)
model = model.to(torch.bfloat16)
model.eval()

1️⃣ Text-to-image generation

content = "Your instruction."
images_batch = [None]

2️⃣ Image Understanding

content = "<im_start><image><im_end>\n Your instruction."
img = Image.open("image_path")
images_batch = [img,]

3️⃣ Text-only Question Answering

content = "Your instruction."
images_batch = [None]

Then run the following code:

messages_batch = [[{"role": "user", "content": content} ],]
texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages_batch]
inputs = processor(text=texts, images=images_batch, return_tensors="pt", add_im_start_id=False)
inputs = {k: (v.to(device=device) if isinstance(v, torch.Tensor) else v) for k, v in inputs.items()}

gen_config = {
    "max_length": 300,
    "cfg_scale": 9.5, # if generation
    "temperature": 0.0,
    "num_inference_steps": 80, # if use# if generation
    "alpha": 0.5, # if generation
    "edit_image": False # if generation
    }

inputs.update(gen_config)
generated = model.generate(**inputs)
input_ids = generated["input_ids"]

images = generated["images"][0] # if generation
current_img = images[0] # if generation
current_img = current_img.clamp(0.0, 1.0) # if generation
save_image(current_img, f"outputs/case_.png") # if generation
print(processor.tokenizer.batch_decode(input_ids, skip_special_tokens=True)) # if understanding or text-only qa

Alternatively, you can directly run the code in Inference/ for a quick demo.

πŸ”§ Training

Please follow the VeOmni framework guidelines to set up the training environment. The training workspace is located in the Training/ directory. Of course, you can also directly enter the Training/ directory and run pip install -e . to quickly set up the training environment. Then you can run the following scripts:

bash train_align.sh # for alignment

or

bash train_sft.sh # for training all parameters except the VAE.

Notably, the training data format can follow the template at Training/data/format.jsonl. Please also remember to update the training configuration in Training/configs/multimodal/cheers/und_gen_train/.

After training, please replace the config.json file in the output directory with Training/cheers_config/config.json to ensure correct evaluation.

πŸ“Š Evaluation

Understanding

Please follow the VLMEvalKit framework guidelines to set up the evaluation environment. The evaluation workspace is located in the Evaluation_Understanding/ directory. Then you can run the following scripts:

torchrun --master_port=19507 --nproc-per-node=8  run.py --data \
    MathVista_MINI MMBench_DEV_EN_V11 SEEDBench_IMG \
    MMStar POPE RealWorldQA MMMU_DEV_VAL ScienceQA_TEST  \
    AI2D_TEST OCRBench TextVQA_VAL ChartQA_TEST \
    --model Cheers --verbose 

Similarly, you can directly run the following script to perform the evaluation:

bash eval.sh

Please make sure to update the dataset path in eval.sh and the model path in vlmeval/config.py before running the script.

GenEval

Please follow the GenEval framework guidelines to set up the GenEval evaluation environment. The evaluation workspace is located in the Evaluation_GenEval/ directory. Before running the evaluation, please download the Mask2Former object detector and place it in models. Then you can run the following scripts:

bash generation/run_short.sh

or

bash generation/run_long.sh # for rewrite prompt

Use the follow script to get the final score:

bash calculate.sh

DPGBench

Please follow the ELLA framework guidelines to set up the DPGBench evaluation environment. Before running the evaluation, please download the mplug and place it in benchmarks/dpg/mplug_visual-question-answering_coco_large_en. Then you can run the following scripts:

bash Evaluation_DPGBench/scripts/dpg_gen.sh

Then:

bash Evaluation_DPGBench/scripts/dpg_eval.sh # Remember to replace the image folder

🧩 To-Do List

  • Release the Inference Scripts and Checkpoints
  • Release the Training Scripts using the VeOmni framework
  • Release the Evaluation Scripts
  • Release the Training Data
  • Release Cheers v1.1 β€” maintaining strong understanding performance while further improving generation quality

πŸ™ Acknowledgement

This repo benefits from VeOmni, VLMEvalKit , GenEval and ELLA. Thanks for their wonderful works.

πŸ“¬ Contact

For any questions or collaborations, feel free to contact us : )

πŸ“§ guozonghao96@outlook.com Β Β  | Β Β  πŸ“§ yichen0zhang@gmail.com Β Β  | Β Β  πŸ“§ MetaPDa@gmail.com Β Β 

πŸ“– Citation

If you find Cheers useful, please cite Cheers technical report using this BibTeX.

@article{zhang2026cheers,
  title={CHEERS: DECOUPLING PATCH DETAILS FROM SEMANTIC REPRESENTATIONS ENABLES UNIFIED MULTIMODAL COMPREHENSION AND GENERATION},
  author={Zhang, Yichen and Peng, Da and Guo, Zonghao and Zhang, Zijian and Yang, Xuesong and Sun, Tong and Sun, Shichu and Zhang, Yidan and Li, Yanghao and Zhao, Haiyan and others},
  journal={arXiv preprint arXiv:2603.12793},
  year={2026}
}

About

Cheers to Open Source of Unified Multi-modal Model!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages