Wooseok Jang1, Seonghu Jeon1, Jisang Han1, Jinhyeok Choi1, Minkyung Kwon1, Seungryong Kim1, Saining Xie2, Sainan Liu3
1KAIST 2New York University 3Intel Labs
- 2026-03-25: Clean up camera conventions and remove unused debugging code. All input cameras are now expected in OpenCV convention (X-right, Y-down, Z-forward) + Updated Checkpoint.
- 2026-03-24: Initial code and model release.
GLD performs multi-view diffusion in the feature space of geometric foundation models (Depth Anything 3 / VGGT), enabling novel view synthesis with zero-shot geometry — trained from scratch without text-to-image pretraining.
- 4.4× faster training convergence vs. VAE-based approaches
- Zero-shot depth & 3D from synthesized latents via frozen decoders
- State-of-the-art on RE10K and DL3DV benchmarks
- GPU: 48GB+ VRAM recommended (e.g., A6000, A100). Cascade mode loads two DiT models simultaneously.
- Python: 3.10+
conda env create -f environment.yml
conda activate gldDownload all checkpoints from HuggingFace:
# Download all model weights
python -c "from huggingface_hub import snapshot_download; snapshot_download('SeonghuJeon/GLD', local_dir='.')"This places files as follows:
pretrained_models/
da3/
model.safetensors # DA3-Base encoder weights
dpt_decoder.pt # DPT decoder (depth + geometry)
mae_decoder.pt # DA3 MAE decoder (RGB)
vggt/
mae_decoder.pt # VGGT MAE decoder (RGB)
checkpoints/
da3_level1.pt # DA3 level-1 diffusion
da3_cascade.pt # DA3 cascade (level-1 → level-0)
vggt_level1.pt # VGGT level-1 diffusion
vggt_cascade.pt # VGGT cascade (level-1 → level-0)
model_stats/ # Latent normalization statistics (included in repo)
da3/normalization_stats_level{0-3}.pt
vggt/normalization_stats_level{0-3}.pt
vggt/special_stats_level{0-3}.pt
model_stats/andconfigs/are already included in the repository.
# DA3 backbone
./run_demo.sh da3
# VGGT backbone
./run_demo.sh vggtThis runs NVS on included demo scenes and generates 3D reconstructions (GLB + COLMAP).
To specify a GPU: ./run_demo.sh da3 <GPU_ID>
NOTE: For now, 3D reconstruction is supported for DA3 Only. 3D Reconstruction code for VGGT checkpoint will be updated soon!
# DA3 level-1
./run_train.sh da3 level1
# DA3 cascade (level-1 → level-0)
./run_train.sh da3 cascade
# VGGT level-1
./run_train.sh vggt level1Multi-GPU: edit --nproc_per_node in run_train.sh.
Train the MAE decoder (RGB reconstruction) on frozen DA3 encoder features with GAN + LPIPS losses:
./scripts/run_train_stage1_mae.sh [NUM_GPUS] [RESUME_CKPT]
# Example: 4 GPUs
./scripts/run_train_stage1_mae.sh 4
# Resume from checkpoint
./scripts/run_train_stage1_mae.sh 4 results/stage1-mae/.../checkpoints/0050000.ptSee configs/training/DA3_stage1_mae.yaml for training hyperparameters.
# DA3 cascade (default)
./eval_gld.sh da3 cascade
# VGGT cascade
./eval_gld.sh vggt cascade
# Independent (single level, no cascade)
./eval_gld.sh da3 independent├── src/
│ ├── stage1/ # Feature encoder (DA3/VGGT) + decoders (MAE/DPT)
│ ├── stage2/ # DiT diffusion transformer
│ ├── utils/ # Metrics, camera, config, validation
│ ├── datasets/ # Eval dataset adapter
│ ├── video/ # Training data loaders (CUT3R format)
│ ├── train_multiview_da3.py # Stage 2 training
│ ├── train_stage1_mae.py # Stage 1 decoder training
│ └── eval_gld_metric.py
├── configs/
│ ├── training/ # Model configs (DA3/VGGT × level1/cascade)
│ └── eval/ # Evaluation configs
├── demo/ # Demo scenes (RE10K + DL3DV)
├── scripts/ # 3D reconstruction utilities
├── run_train.sh
├── eval_gld.sh
├── run_demo.sh
└── environment.yml
@article{jang2026gld,
title={Repurposing Geometric Foundation Models for Multi-view Diffusion},
author={Jang, Wooseok and Jeon, Seonghu and Han, Jisang and Choi, Jinhyeok and Kwon, Minkyung and Kim, Seungryong and Xie, Saining and Liu, Sainan},
journal={arXiv preprint arXiv:2603.22275},
year={2026}
}Built upon RAE, Depth Anything 3, VGGT, CUT3R, and SiT.
