Skip to content

cvlab-kaist/Seg4Diff

Repository files navigation

Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers [NeurIPS 2025]

This is the official implementation of Seg4Diff.

by Chaehyun Kim1, Heeseong Shin1, Eunbeen Hong1, Heeji Yoon1, Anurag Arnab, Paul Hongsuck Seo2, Sunghwan Hong3,†, Seungryong Kim1,†

1 KAIST AI, 2 Korea University, 3 ETH Zürich
(Co-Corresponding Author)

Introduction

Seg4Diff is a systematic framework that analyzes and enhances the emergent semantic grounding capabilities of multi-modal diffusion transformers (MM-DiTs). We discover that specific intermediate layers, which we term semantic grounding expert layers, naturally produce high-quality zero-shot segmentation masks by aligning text tokens with corresponding image regions. Building on this insight, we introduce a lightweight LoRA fine-tuning method, Mask Alignment for Segmentation and Generation (MAGNET), to further refine this alignment, simultaneously improving both open-vocabulary segmentation and text-to-image generation quality.

For further details and visualization results, please check out our paper and our project page.

Installation

Please follow installation.

Data Preparation

We largely follow PixelCLIP's dataset preparation procedure. Please refer to dataset preperation.

Zero-shot Evaluation

eval_*.sh automatically evaluates the model following our evaluation protocol, with pre-trained Stable Diffusion 3 (SD3) by default. To use different models like Stable Diffusion 3.5 or Flux.1-dev, add options to specify the backbone. The weights will be downloaded automatically in the first execution.

Evaluation script

sh eval_*.sh [CONFIG] [NUM_GPUS] [OUTPUT_DIR] [OPTS]

# Open-vocabulary semantic segmentation
sh eval_ovss.sh ./configs/eval_ovss.yaml 1 ./output/ovss 
# Unsupervised segmentation
sh eval_unsup.sh ./configs/eval_unsup.yaml 1 ./output/unsup

Training

run.py trains the model in default configuration and evaluates the model after training, which is executed by run.sh. We provide generated captions for 10k sampled subset of SA-1B and COCO, which can be found in datasets/captions.

To train or evaluate the model in different environments, modify the given shell script and config files accordingly.

Training script

sh run.sh [CONFIG] [NUM_GPUS] [OUTPUT_DIR] [OPTS]

# With SA-1B masks and captions
sh run.sh configs/train_sa1b.yaml 2 output/
# With COCO masks and captions
sh run.sh configs/train_coco.yaml 2 output/

Pretrained weights

We provide SA-1B-trained and COCO-trained lora weights for Stable Diffusion 3 (SD3). Download the .pth model weights to your desired directory (from the Hugging Face Hub):

from huggingface_hub import hf_hub_download

# Download COCO-trained lora weights
LORA_PATH_COCO = hf_hub_download(
    repo_id="chyun/seg4diff-coco-lora",
    filename="lora_weights.pth",
    cache_dir="/path/to/save/coco",
)
print("Downloaded to: ", LORA_PATH_COCO)

# Download SA1B-trained lora weights
LORA_PATH_SA1B = hf_hub_download(
    repo_id="chyun/seg4diff-coco-lora",
    filename="lora_weights.pth",
    cache_dir="/path/to/save/coco",
)
print("Downloaded to: ", LORA_PATH_SA1B)

To run the evaluation script with lora weights, specify the model weights like following:

# Evaluate COCO-trained model
sh eval_ovss.sh ./configs/eval_ovss.yaml 1 ./output/ovss MODEL.WEIGHTS LORA_PATH_COCO

# Evaluate SA1B-trained model
sh eval_ovss.sh ./configs/eval_ovss.yaml 1 ./output/ovss MODEL.WEIGHTS LORA_PATH_SA1B

Citing Seg4Diff

@misc{kim2025seg4diffunveilingopenvocabularysegmentation,
      title={Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers}, 
      author={Chaehyun Kim and Heeseong Shin and Eunbeen Hong and Heeji Yoon and Anurag Arnab and Paul Hongsuck Seo and Sunghwan Hong and Seungryong Kim},
      year={2025},
      eprint={2509.18096},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.18096}, 
}

About

Official implementation of "Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers" (NeurIPS 2025)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages