Geometric Latent Diffusion: Repurposing Geometric Foundation Models for Multi-view Diffusion

Wooseok Jang¹, Seonghu Jeon¹, Jisang Han¹, Jinhyeok Choi¹, Minkyung Kwon¹, Seungryong Kim¹, Saining Xie², Sainan Liu³

¹KAIST ²New York University ³Intel Labs

News

2026-03-25: Clean up camera conventions and remove unused debugging code. All input cameras are now expected in OpenCV convention (X-right, Y-down, Z-forward) + Updated Checkpoint.
2026-03-24: Initial code and model release.

Overview

GLD performs multi-view diffusion in the feature space of geometric foundation models (Depth Anything 3 / VGGT), enabling novel view synthesis with zero-shot geometry — trained from scratch without text-to-image pretraining.

4.4× faster training convergence vs. VAE-based approaches
Zero-shot depth & 3D from synthesized latents via frozen decoders
State-of-the-art on RE10K and DL3DV benchmarks

Requirements

GPU: 48GB+ VRAM recommended (e.g., A6000, A100). Cascade mode loads two DiT models simultaneously.
Python: 3.10+

Installation

conda env create -f environment.yml
conda activate gld

Pretrained Models

Download all checkpoints from HuggingFace:

# Download all model weights
python -c "from huggingface_hub import snapshot_download; snapshot_download('SeonghuJeon/GLD', local_dir='.')"

This places files as follows:

pretrained_models/
  da3/
    model.safetensors              # DA3-Base encoder weights
    dpt_decoder.pt                 # DPT decoder (depth + geometry)
  mae_decoder.pt                   # DA3 MAE decoder (RGB)
  vggt/
    mae_decoder.pt                 # VGGT MAE decoder (RGB)

checkpoints/
  da3_level1.pt                    # DA3 level-1 diffusion
  da3_cascade.pt                   # DA3 cascade (level-1 → level-0)
  vggt_level1.pt                   # VGGT level-1 diffusion
  vggt_cascade.pt                  # VGGT cascade (level-1 → level-0)

model_stats/                       # Latent normalization statistics (included in repo)
  da3/normalization_stats_level{0-3}.pt
  vggt/normalization_stats_level{0-3}.pt
  vggt/special_stats_level{0-3}.pt

model_stats/ and configs/ are already included in the repository.

Quick Demo

# DA3 backbone
./run_demo.sh da3

# VGGT backbone
./run_demo.sh vggt

This runs NVS on included demo scenes and generates 3D reconstructions (GLB + COLMAP). To specify a GPU: ./run_demo.sh da3 <GPU_ID>

NOTE: For now, 3D reconstruction is supported for DA3 Only. 3D Reconstruction code for VGGT checkpoint will be updated soon!

Training

Stage 2: Multi-view Diffusion

# DA3 level-1
./run_train.sh da3 level1

# DA3 cascade (level-1 → level-0)
./run_train.sh da3 cascade

# VGGT level-1
./run_train.sh vggt level1

Multi-GPU: edit --nproc_per_node in run_train.sh.

Stage 1: Decoder Training

Train the MAE decoder (RGB reconstruction) on frozen DA3 encoder features with GAN + LPIPS losses:

./scripts/run_train_stage1_mae.sh [NUM_GPUS] [RESUME_CKPT]

# Example: 4 GPUs
./scripts/run_train_stage1_mae.sh 4

# Resume from checkpoint
./scripts/run_train_stage1_mae.sh 4 results/stage1-mae/.../checkpoints/0050000.pt

See configs/training/DA3_stage1_mae.yaml for training hyperparameters.

Evaluation

# DA3 cascade (default)
./eval_gld.sh da3 cascade

# VGGT cascade
./eval_gld.sh vggt cascade

# Independent (single level, no cascade)
./eval_gld.sh da3 independent

Project Structure

├── src/
│   ├── stage1/                    # Feature encoder (DA3/VGGT) + decoders (MAE/DPT)
│   ├── stage2/                    # DiT diffusion transformer
│   ├── utils/                     # Metrics, camera, config, validation
│   ├── datasets/                  # Eval dataset adapter
│   ├── video/                     # Training data loaders (CUT3R format)
│   ├── train_multiview_da3.py     # Stage 2 training
│   ├── train_stage1_mae.py        # Stage 1 decoder training
│   └── eval_gld_metric.py
├── configs/
│   ├── training/                  # Model configs (DA3/VGGT × level1/cascade)
│   └── eval/                      # Evaluation configs
├── demo/                          # Demo scenes (RE10K + DL3DV)
├── scripts/                       # 3D reconstruction utilities
├── run_train.sh
├── eval_gld.sh
├── run_demo.sh
└── environment.yml

Citation

@article{jang2026gld,
  title={Repurposing Geometric Foundation Models for Multi-view Diffusion},
  author={Jang, Wooseok and Jeon, Seonghu and Han, Jisang and Choi, Jinhyeok and Kwon, Minkyung and Kim, Seungryong and Xie, Saining and Liu, Sainan},
  journal={arXiv preprint arXiv:2603.22275},
  year={2026}
}

Acknowledgements

Built upon RAE, Depth Anything 3, VGGT, CUT3R, and SiT.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
configs		configs
demo		demo
model_stats		model_stats
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compute_feature_stats.py		compute_feature_stats.py
environment.yml		environment.yml
eval_gld.sh		eval_gld.sh
pyproject.toml		pyproject.toml
run_demo.sh		run_demo.sh
run_train.sh		run_train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Geometric Latent Diffusion: Repurposing Geometric Foundation Models for Multi-view Diffusion

News

Overview

Requirements

Installation

Pretrained Models

Quick Demo

Training

Stage 2: Multi-view Diffusion

Stage 1: Decoder Training

Evaluation

Project Structure

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Geometric Latent Diffusion: Repurposing Geometric Foundation Models for Multi-view Diffusion

News

Overview

Requirements

Installation

Pretrained Models

Quick Demo

Training

Stage 2: Multi-view Diffusion

Stage 1: Decoder Training

Evaluation

Project Structure

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages