🎉🎉🎉 (2026-04-09) Our work has been selected as Highlight.
🎉🎉🎉 (2026-02-21) Our work has been accepted to CVPR 2026, Final reviewer scores: 6 (Acc) / 5 (Weak Acc) / 5 (Weak Acc).
S²-Corr introduces a state-space powered correlation refinement module that stabilizes text–image alignment under domain shift, achieving SOTA performance on both Real-to-Real and Synthetic-to-Real OVDG-SS settings.
-
🧩 State-Space Correlation Aggregation Robust long-range correlation modeling via scan-based state passing.
-
🔍 Open-Vocabulary Semantic Segmentation Compatible with EVA-CLIP text/image encoders.
-
🌍 Domain Generalization Train on CS-7 / GTA-7 → test on ACDC / BDD / Mapillary / ROADWork.
-
🎯 Supports Multiple Category Spaces (7 / 19 / 30 / 41 / 58 classes)
git clone https://github.com/DZhaoXd/s2_corr.git
cd s2_corrconda create -n S2_Corr python=3.10
conda activate S2_Corr
pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1
pip install -r requirements.txt
pip install -e .Cityscapes
Please download leftImg8bit_trainvaltest.zip and gt_trainvaltest.zip from here and extract them to data/cityscapes.
GTA5
Please download all GTA5 image and label packages from here and extract them to data/GTA5/GTAV.
ACDC
Please download rgb_anon_trainvaltest.zip and gt_trainval.zip from here and extract them to data/ACDC.
Then restructure the folders from the original condition/split/sequence/ layout into a flat split/ layout (e.g., rgb_anon/train/, gt/train/).
Please download the ACDC Inpaint data (1000 inpainted image pairs) using the Stable Diffusion model from here and extract them to data/ACDC/ACDC_inpaint41.
BDD100K
Please download 10K Images and Segmentation from here and extract them to data/BDD/bdd100k.
Please download the BDD100K Inpaint data (1000 inpainted image pairs) using the Stable Diffusion model from here and extract them to data/BDD/bdd_inpaint41.
Mapillary
Please download mapillary-vistas-dataset_public_v1.2.zip from here and extract it to data/mapillary.
ROADWork_Data
Please download the two files images.zip (≈ 9.87 GB) and sem_seg_labels.zip (≈ 187 MB) from here
and extract them to data/ROADWork_Data.
python tools/convert_datasets_to19/gta.py data/gta
python tools/convert_datasets_to19/cityscapes.py data/cityscapes
python tools/convert_datasets_to19/mapillary.py data/mapillary
python tools/convert_datasets_ovss/prepare_cityscapes_seen_7.py
python tools/convert_datasets_ovss/process_GTA_19_to_7.pypython tools/process_Mapi_65.py #
python tools/cp_Mapi_training.py # merge train set and val set
python tools/process_RW_10.pyFolder structure under data/ should look like:
data/
├── GTA5/
│ └── GTAV/
│ ├── images/ # 24966
│ ├── labels_7/ # c-7
│ └── labels_19/ # c-19
│
├── cityscape/
│ ├── leftImg8bit/
│ │ └── train/ # 2,975
│ ├── gtFine_7/
│ │ └── train/ # c-7
│ └── gtFine_19/
│ └── train/ # c-19
│
├── BDD/
│ ├── bdd100k/
│ │ ├── images/10k/val/ # 1,000
│ │ └── labels/sem_seg/masks/val/ # c-19
│ │
│ ├── bdd_inpaint41/
│ ├── images/ # 1,000
│ └── labels/ # c-41
│
│
├── ACDC/
│ ├── rgb_anon/train/ # 1,600 (c-19)
│ └── gt/train/ # c-19
│ │
│ └── ACDC_inpaint41/
│ ├── images/ # 1,000
│ └── labels/ # c-41
│
│
├── mapillary/
│ ├── val/
│ │ ├── images/ # 2,000
│ │ └── labels_TrainIds/ # c-19
│ │
│ └── OV_30/
│ ├── images/ # 3,943
│ └── labels/ # c-30
│
└── ROADWork_Data/
├── images/ # 2,098
└── gtFine_10/ # c-10
Download EVA-CLIP weights from:
👉 https://github.com/baaivision/EVA/tree/master/EVA-CLIP
Place under:
Pretrain/
EVA02_CLIP_B_psz16_s8B.pt
EVA02_CLIP_L_336_psz14_s6B.pt
Training script format:
bash run.sh <CONFIG_YAML> <NUM_GPUS> <OUTPUT_DIR>ViT-B/16
CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/cs7_catseg.yaml 1 outputs/cs7_eva_b16_r512 \
> logs/cs7_eva_b16_r512.log 2>&1 &ViT-L/14
CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/cs7_catseg_vitl.yaml 1 outputs/cs7_eva_L14_r448 \
> logs/cs7_eva_L14_r448.log 2>&1 &ViT-B/16
CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/gta5_seen7_catseg.yaml 1 outputs/gta_seen7_eva_b16_r512 \
> logs/gta_seen7_eva_b16_r512.log 2>&1 &ViT-L/14
CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/gta5_seen7_catseg_vitl.yaml 1 outputs/gta_seen7_eva_L14_r448 \
> logs/gta_seen7_eva_L14_r448.log 2>&1 &CUDA_VISIBLE_DEVICES=0 nohup bash run.sh configs/cs19_catseg_vitl.yaml 1 outputs/cs19_eva_L14_r448 \
> logs/cs19_eva_L14_r448.log 2>&1 &CUDA_VISIBLE_DEVICES=0 nohup sh demo/vis.sh configs/cs7_catseg.yaml 1 outputs/cs7_eva_b16_r512 \
> logs/viz_cs7_eva_b16_r512.log 2>&1 &CUDA_VISIBLE_DEVICES=0 nohup sh demo/vis_atten.sh configs/cs7_catseg.yaml 1 outputs/cs7_eva_b16_r512 \
> logs/viz_attention_cs7_eva_b16_r512.log 2>&1 &@misc{zhao2026OVDG,
title={Open-Vocabulary Domain Generalization in Urban-Scene Segmentation},
author={Dong Zhao and Qi Zang and Nan Pu and Wenjing Li and Nicu Sebe and Zhun Zhong},
year={2026},
eprint={2602.18853},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.18853},
}This project builds upon CAT-Seg
