VODiff: Controlling Object Visibility Order in Text-to-Image Generation (CVPR 25)

Code and project page for our paper "VODiff: Controlling Object Visibility Order in Text-to-Image Generation", accepted to CVPR 2025.

✨ Overview

VODiff is a training-free framework that introduces object visibility order as a new controllable dimension in layout-to-image text-to-image (T2I) generation.

Compared with previous methods, which cannot explicitly control object occlusion, VODiff enables accurate generation of complex scenes with user-defined spatial and occlusion relationships via two core designs:

Sequential Denoising Process (SDP): Synthesizes objects in layers, bottom to top, according to visibility order.
Visibility-Order-Aware (VOA) Loss: Optimizes cross-attention maps to enforce correct spatial and occlusion constraints.

🖼️ Teaser

🛠️ Environment Setup

# Create a new conda environment
conda create --name vodiff python=3.9
conda activate vodiff

# Install dependencies
conda install -r requirements.txt

📦 Pretrained Models

Download the pretrained model (e.g., GLIGEN) and place it in the checkpoints directory.

🚀 Inference

To run inference, use the provided Jupyter notebook, please modify these parts to define your own inputs.

caption = 'A car and a bike in front of a house.'
names_list = ['house', 'car', 'bike']  # Ordered by visibility (back to front)
layout = [(66, 197, 452, 390), (326, 358, 402, 432), (111, 347, 216, 431)]  # Corresponding bounding boxes

The names_list should be ordered by visibility, i.e., from background to foreground.

🙏 Acknowledgments

This project is built upon the following resources:

Attention Refocusing: Our codebase is based on the foundational work provided by Attention Refocusing.

If you have any questions or issues, please feel free to open an issue or contact us.

🪪 License

This project is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.

🔗 License Details

📚 Citation

If you find VODiff useful in your research, please consider citing us:

@inproceedings{liang2025vodiff,
  title={VODiff: Controlling Object Visibility Order in Text-to-Image Generation},
  author={Liang, Dong and Jia, Jinyuan and Liu, Yuhao and Ke, Zhanghan and Fu, Hongbo and Lau, Rynson WH},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={18379--18389},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dataset		dataset
grounding_input		grounding_input
ldm		ldm
README.md		README.md
Roboto-LightItalic.ttf		Roboto-LightItalic.ttf
SD_input_conv_weight_bias.pth		SD_input_conv_weight_bias.pth
box_utils.py		box_utils.py
convert_ckpt.py		convert_ckpt.py
distributed.py		distributed.py
inference.ipynb		inference.ipynb
inpaint_mask_func.py		inpaint_mask_func.py
requirements.txt		requirements.txt
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VODiff: Controlling Object Visibility Order in Text-to-Image Generation (CVPR 25)

✨ Overview

🖼️ Teaser

🛠️ Environment Setup

📦 Pretrained Models

🚀 Inference

🙏 Acknowledgments

🪪 License

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VODiff: Controlling Object Visibility Order in Text-to-Image Generation (CVPR 25)

✨ Overview

🖼️ Teaser

🛠️ Environment Setup

📦 Pretrained Models

🚀 Inference

🙏 Acknowledgments

🪪 License

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages