This repository provides end-to-end tooling for Generalized Referring Expression Segmentation on Aerial Photos (submitted to IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, J-STARS). The project introduces:
- Aerial-D, a 37,288-image dataset with 1.52M referring expressions covering instances, groups, and semantic regions across 21 categories.
- Automatic data generation, combining rule-based templates with LLM rewriting to produce grounded language at scale while filtering ambiguous references.
- Unified RSRefSeg training, pairing SigLIP2 and SAM with LoRA adapters to learn from Aerial-D alongside RefSegRS, RRSIS-D, NWPU-Refer, and Urban1960SatSeg.
All public artifacts live in the 🤗 Aerial-D collection on Hugging Face:
- luisml77/aerial-d — full dataset release
- luisml77/gemma-aerial-12b — Gemma3 finetuned weights for Step 7
- luisml77/aeriald_o3_500 — distilled 500-sample o3 dataset for Gemma3 distillation
- luisml77/rsrefseg — RSRefSeg checkpoints (
rsrefseg_aerial-d.pt,rsrefseg_combined.pt)
datagen/: dataset extraction, rule-driven expression generation, historic filtering, and enhancement utilities.rsrefseg/: SigLIP+SAM training/testing, visualizations, and style-transfer experiments.llm/: Gemma3 enhancement pipeline, QLoRA fine-tuning, and OpenAI o3 reference scripts.docs/: project webpage files.tex/: LaTeX source for article and dissertation.
Option 1: Using Conda (recommended, requires Python 3.12)
conda create -n aerial python=3.12
conda activate aerial
pip install -r requirements.txtOption 2: Using venv
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtYou can reproduce Aerial-D locally or download the public release.
Download from Hugging Face
huggingface-cli download luisml77/aerial-d --repo-type dataset --local-dir datagen/datasetReproduce Locally - Complete Pipeline
The dataset generation has two main phases:
Phase 1: Rule-based Generation (Steps 1-6)
First, download the source datasets:
cd ~/aerial-d/datagen/pipeline
# Download iSAID dataset (~20GB, train + val + test)
./download_isaid.sh
# Download LoveDA dataset (~2.5GB, train + val)
./download_loveda.shThen generate rule-based expressions:
cd ~/aerial-d/datagen
# Generate rule-based expressions (skipping LLM enhancement)
./pipeline/run_pipeline.sh --skip_step7 --clean
# For small test (10 images per split)
./pipeline/run_pipeline.sh --skip_step7 --num_images 10 --random_seed 42 --cleanPhase 2: LLM Enhancement (Step 7)
Step 7 requires either (A) running OpenAI o3 enhancement + finetuning your own Gemma3 model, or (B) downloading the pre-distilled Gemma3 model:
Option A: Full LLM Pipeline (from scratch)
cd ~/aerial-d/llm
# 1a. Generate high-quality o3 samples (requires OpenAI API key, ~$10.36 for 500 samples)
python o3_enhance.py --dataset_dir ../datagen/dataset
# OR 1b. Download pre-generated o3 samples (skip expensive o3 API calls)
huggingface-cli download luisml77/aeriald_o3_500 --repo-type dataset --local-dir enhanced_annotations_o3_dual
# 2. Finetune Gemma3 on o3 samples using QLoRA
python gemma3_lora_finetune.py \
--enhanced_data_dir enhanced_annotations_o3_dual \
--model_name gemma-aerial-12b \
--output_dir ./gemma-aerial-12b \
--lora_r 64 --lora_alpha 16
# 3. Start vLLM server with finetuned model
vllm serve ./gemma-aerial-12b --port 8000
# 4. Run Step 7 (in another terminal)
cd ~/aerial-d/datagen
python pipeline/7_vllm_enhance.pyOption B: Using Pre-trained Gemma3 Model (faster)
# 1. Download pre-distilled Gemma3 model
huggingface-cli download luisml77/gemma-aerial-12b --repo-type model --local-dir llm/gemma-aerial-12b
# 2. Start vLLM server with downloaded model
cd ~/aerial-d/llm
vllm serve ./gemma-aerial-12b --port 8000
# 3. Run Step 7 (in another terminal)
cd ~/aerial-d/datagen
python pipeline/7_vllm_enhance.pyPackage dataset (optional)
cd ~/aerial-d/datagen
python pipeline/zip_dataset.py --base_dir dataset --zip_path aeriald.zipThe pipeline extracts iSAID/LoveDA patches, assigns rules (3×3 grid, relations, extremes, size cues), generates expressions, filters for uniqueness, and applies optional historic filters. Utilities for viewing and metrics live under datagen/utils/.
model.py defines the SigLIP2 + SAM architecture (RSRefSeg) with LoRA adapters. Training and testing use the dataset downloaded above.
# Train (writes checkpoint under rsrefseg/models/ by default)
cd ~/aerial-d/rsrefseg
python train.py --dataset_root ../datagen/dataset --custom_name aeriald_run
# Test the produced checkpoint
python test.py --model_name aeriald_run --dataset_type aeriald
# Or download published checkpoints from HuggingFace
huggingface-cli download luisml77/rsrefseg --repo-type model --local-dir models/
# Test with Aerial-D only checkpoint (rsrefseg_aerial-d.pt)
python test.py --model_name rsrefseg_aerial-d --dataset_type aeriald
# Test with combined multi-dataset checkpoint (rsrefseg_combined.pt, requires SAM ViT-Large)
python test.py --model_name rsrefseg_combined --dataset_type aeriald --sam_model facebook/sam-vit-largeThe training script fine-tunes SigLIP2-SO400M and SAM-ViT (Base or Large) on Aerial-D only. The optional --custom_name flag controls the run folder name under rsrefseg/models/, which you pass to test.py for evaluation.
Browse the complete dataset with images, expressions, and segmentation masks through an interactive web interface:
cd ~/aerial-d/datagen
python utils/rule_viewer.py --port 5004
# Navigate to http://localhost:5004Test trained models with your own images and referring expressions:
cd ~/aerial-d/rsrefseg
CUDA_VISIBLE_DEVICES=0 python utils/rsrefseg_inference_app.py \
--model_name aeriald_run \
--sam_model facebook/sam-vit-large \
--port 5002
# Navigate to http://localhost:5002Training-time augmentations approximate monochrome, grainy, and sepia degradations through luminance conversion, gamma/contrast adjustments, and additive noise. Combined with Urban1960SatSeg, these filters preserve segmentation quality under archival conditions.
If you use this dataset or code, please cite:
@article{marnoto2025aeriald,
title={Generalized Referring Expression Segmentation on Aerial Photos},
author={Marnoto, Luís and Bernardino, Alexandre and Martins, Bruno},
journal={IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing},
year={2025},
note={Submitted}
}This work was developed as part of the Master's thesis "Generalized Referring Expression Segmentation on Aerial Photos" by Luis Marnoto at Instituto Superior Técnico (IST). The complete dissertation is available at: https://fenix.tecnico.ulisboa.pt/cursos/meec21/dissertacao/283828618791208
Issues and pull requests are welcome. Please open an issue before submitting substantial changes.
- Luís Marnoto: luis.marnoto.gaspar.lopes@tecnico.ulisboa.pt
- Alexandre Bernardino: alexandre.bernardino@tecnico.ulisboa.pt
- Bruno Martins: bruno.g.martins@tecnico.ulisboa.pt
Or open a GitHub issue.



