Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers [NeurIPS 2025]

This is the official implementation of Seg4Diff.

by Chaehyun Kim¹, Heeseong Shin¹, Eunbeen Hong¹, Heeji Yoon¹, Anurag Arnab, Paul Hongsuck Seo², Sunghwan Hong^3,†, Seungryong Kim^1,†

¹ KAIST AI, ² Korea University, ³ ETH Zürich
(^†Co-Corresponding Author)

Introduction

Seg4Diff is a systematic framework that analyzes and enhances the emergent semantic grounding capabilities of multi-modal diffusion transformers (MM-DiTs). We discover that specific intermediate layers, which we term semantic grounding expert layers, naturally produce high-quality zero-shot segmentation masks by aligning text tokens with corresponding image regions. Building on this insight, we introduce a lightweight LoRA fine-tuning method, Mask Alignment for Segmentation and Generation (MAGNET), to further refine this alignment, simultaneously improving both open-vocabulary segmentation and text-to-image generation quality.

For further details and visualization results, please check out our paper and our project page.

Installation

Please follow installation.

Data Preparation

We largely follow PixelCLIP's dataset preparation procedure. Please refer to dataset preperation.

Zero-shot Evaluation

eval_*.sh automatically evaluates the model following our evaluation protocol, with pre-trained Stable Diffusion 3 (SD3) by default. To use different models like Stable Diffusion 3.5 or Flux.1-dev, add options to specify the backbone. The weights will be downloaded automatically in the first execution.

Evaluation script

sh eval_*.sh [CONFIG] [NUM_GPUS] [OUTPUT_DIR] [OPTS]

# Open-vocabulary semantic segmentation
sh eval_ovss.sh ./configs/eval_ovss.yaml 1 ./output/ovss 
# Unsupervised segmentation
sh eval_unsup.sh ./configs/eval_unsup.yaml 1 ./output/unsup

Training

run.py trains the model in default configuration and evaluates the model after training, which is executed by run.sh. We provide generated captions for 10k sampled subset of SA-1B and COCO, which can be found in datasets/captions.

To train or evaluate the model in different environments, modify the given shell script and config files accordingly.

Training script

sh run.sh [CONFIG] [NUM_GPUS] [OUTPUT_DIR] [OPTS]

# With SA-1B masks and captions
sh run.sh configs/train_sa1b.yaml 2 output/
# With COCO masks and captions
sh run.sh configs/train_coco.yaml 2 output/

Pretrained weights

We provide SA-1B-trained and COCO-trained lora weights for Stable Diffusion 3 (SD3). Download the .pth model weights to your desired directory (from the Hugging Face Hub):

from huggingface_hub import hf_hub_download

# Download COCO-trained lora weights
LORA_PATH_COCO = hf_hub_download(
    repo_id="chyun/seg4diff-coco-lora",
    filename="lora_weights.pth",
    cache_dir="/path/to/save/coco",
)
print("Downloaded to: ", LORA_PATH_COCO)

# Download SA1B-trained lora weights
LORA_PATH_SA1B = hf_hub_download(
    repo_id="chyun/seg4diff-coco-lora",
    filename="lora_weights.pth",
    cache_dir="/path/to/save/coco",
)
print("Downloaded to: ", LORA_PATH_SA1B)

To run the evaluation script with lora weights, specify the model weights like following:

# Evaluate COCO-trained model
sh eval_ovss.sh ./configs/eval_ovss.yaml 1 ./output/ovss MODEL.WEIGHTS LORA_PATH_COCO

# Evaluate SA1B-trained model
sh eval_ovss.sh ./configs/eval_ovss.yaml 1 ./output/ovss MODEL.WEIGHTS LORA_PATH_SA1B

Citing Seg4Diff

@misc{kim2025seg4diffunveilingopenvocabularysegmentation,
      title={Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers}, 
      author={Chaehyun Kim and Heeseong Shin and Eunbeen Hong and Heeji Yoon and Anurag Arnab and Paul Hongsuck Seo and Sunghwan Hong and Seungryong Kim},
      year={2025},
      eprint={2509.18096},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.18096}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
configs		configs
datasets		datasets
demo		demo
diffusers		diffusers
seg4diff		seg4diff
.gitignore		.gitignore
INSTALL.md		INSTALL.md
README.md		README.md
eval_flux.sh		eval_flux.sh
eval_ovss.sh		eval_ovss.sh
eval_sd35.sh		eval_sd35.sh
eval_unsup.sh		eval_unsup.sh
plain_train_net.py		plain_train_net.py
requirements.txt		requirements.txt
run.sh		run.sh
train_net.py		train_net.py
visualize.sh		visualize.sh
visualize_unsup.sh		visualize_unsup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers [NeurIPS 2025]

Introduction

Installation

Data Preparation

Zero-shot Evaluation

Evaluation script

Training

Training script

Pretrained weights

Citing Seg4Diff

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers [NeurIPS 2025]

Introduction

Installation

Data Preparation

Zero-shot Evaluation

Evaluation script

Training

Training script

Pretrained weights

Citing Seg4Diff

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages