Skip to content

HuiZhang0812/CreatiLayout

Repository files navigation

CreatiLayout


HuggingFace HuggingFace HuggingFace HuggingFace

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation
Hui Zhang, Dexiang Hong, Yitong Wang, Jie Shao, Xinglong Wu, Zuxuan Wu, and Yu-Gang Jiang
Fudan University & ByteDance Inc.

Introduction

CreatiLayout is a layout-to-image framework for Diffusion Transformer models, offering high-quality and fine-grained controllable generation.

LayoutSAM Dataset 📚: A large-scale layout dataset with 2.7 million image-text pairs and 10.7 million entities, featuring fine-grained annotations for open-set entities.

SiamLayout 🌟: A novel layout integration network for MM-DiT treats the layout as an independent modality with its own set of transformer parameters, allowing the layout to play an equally important role as the global description in guiding the image.

Layout Designer 🎨: A layout planner leveraging the power of large language models to convert various user inputs (e.g., center points, masks, scribbles) into standardized layouts.

🔥 News

  • 2025-6-26: CreatiLayout was accepted by ICCV 2025 🎉🎉.
  • 2025-3-10: We release CreatiLayout-FLUX, which empowers FLUX.1-dev for layout-to-image generation and achieves more precise rendering of spatial relationships and attributes.
  • 2025-1-30: We propose CreatiLayout-LoRA, which achieves layout control with fewer additional parameters.

Quick Start

Setup

  1. Environment setup
conda create -n creatilayout python=3.10 -y
conda activate creatilayout
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia
  1. Requirements installation
pip install -r requirements.txt

Usage example

You can run the following code to generate an image:

python test_sample.py

Or you can try gradio at HuggingFace.

Dataset

LayoutSAM HuggingFace

The LayoutSAM dataset is a large-scale layout dataset derived from the SAM dataset, containing 2.7 million image-text pairs and 10.7 million entities. Each entity is annotated with a spatial position (i.e., bounding box) and a textual description. Traditional layout datasets often exhibit a closed-set and coarse-grained nature, which may limit the model's ability to generate complex attributes such as color, shape, and texture.

LayoutSAM-eval Benchmark HuggingFace

LayoutSAM-Eval is a comprehensive benchmark for evaluating the quality of Layout-to-Image (L2I) generation models. This benchmark assesses L2I generation quality from two perspectives: region-wise quality (spatial and attribute accuracy) and global-wise quality (visual quality and prompt following). It employs the VLM’s visual question answering to evaluate spatial and attribute adherence, and utilizes various metrics including IR score, Pick score, CLIP score, FID, and IS to evaluate global image quality.

To evaluate the model's layout-to-image generation capabilities through LayoutSAM-Eval, first you need to generate images for each data in the benchmark by running the following code:

python test_SiamLayout_sd3_layoutsam_benchmark.py

Then, visual language models (VLM) are used to answer visual questions. This will assess each image's adherence to spatial and attribute specifications. You can do this by using the following code:

python score_layoutsam_benchmark.py

Models

Layout-to-Image generation:

Model Base model Description
HuggingFace Stable Diffusion 3 SiamLayout-SD3 used in the paper
HuggingFace Stable Diffusion 3 SiamLayout-SD3-LoRA used in the paper
HuggingFace FLUX.1-dev SiamLayout-FLUX used in the paper

✒️ Citation

If you find our work useful for your research and applications, please kindly cite using this BibTeX:

@article{zhang2024creatilayout,
  title={CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation},
  author={Zhang, Hui and Hong, Dexiang and Gao, Tingwei and Wang, Yitong and Shao, Jie and Wu, Xinglong and Wu, Zuxuan and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2412.03859},
  year={2024}
}

About

[ICCV 2025] CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages