Skip to content

AMAP-ML/SocioReasoner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SocioReasoner

arXiv

Official implementation of Urban Socio-Semantic Segmentation with Vision-Language Reasoning.

Abstract: This paper introduces the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross-modal recognition and multi-stage reasoning. We employ reinforcement learning to optimize this non-differentiable process and elicit the reasoning capabilities of the vision-language model. Experiments demonstrate our approach's significant gains over state-of-the-art models and strong zero-shot generalization.

  • Code: Available in this repository.
  • Dataset and Model: Undergoing review (Link coming soon).

1. Installation

  • OS: Linux distribution support for CUDA
  • Hardware: At least 4x NVIDIA H20 (or A100 80GB) GPUs
  • Framework: This repository is based on ROLL, following the below installation instructions.
conda create -n socioseg python=3.10 -y
conda activate socioseg
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir
pip install 'transformer-engine[pytorch]==2.2.0' deepspeed==0.16.4 vllm==0.8.4 --no-build-isolation
pip install -e .

2. Training

Please download the dataset and change the actor_train.data_args.file_name and validation.data_args.file_name in examples/train/rlvr_megatron.yaml

sh examples/train/train.sh

The trained model will be saved in ./output/train/checkpoint/

3. Evaluation and Visualization

Please download the dataset & pretrained model(or the model trained by yourself), and change the actor_train.data_args.file_name and pretrain in examples/infer/rlvr_megatron.yaml

sh examples/infer/infer.sh

The evaluation and visualization results will be saved in ./output/infer/result/

Citation

@article{wang2026socioreasoner,
  title={Urban Socio-Semantic Segmentation with Vision-Language Reasoning}, 
  author={Yu Wang and Yi Wang and Rui Dai and Yujie Wang and Kaikui Liu and Xiangxiang Chu and Yansheng Li},
  journal={arXiv preprint arXiv:2601.10477},
  year={2026}
}

Acknowledgements

We thank the authors of ROLL and SegZero.

About

Official implementation of "Urban Socio-Semantic Segmentation with Vision-Language Reasoning"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages