Official repo for the paper "Multi-view Pyramid Transformer: Look Coarser to See Broader"
[Mar 2026] MVP optionally supports Flash Attention 4 for faster attention computation. If flash-attn-4 is installed, it will be used automatically; otherwise the code falls back to the standard F.scaled_dot_product_attention.
| Views | H100 + FA3 (s) | B200 + FA4 (s) | Speedup |
|---|---|---|---|
| 16 | 0.09 | 0.05 | 1.8× |
| 32 | 0.17 | 0.10 | 1.7× |
| 64 | 0.36 | 0.20 | 1.8× |
| 128 | 0.77 | 0.43 | 1.8× |
| 192 | 1.23 | 0.70 | 1.8× |
| 256 | 1.84 | 1.08 | 1.7× |
Reconstruction time (seconds) at 960x540. H100 numbers from the original paper.
[Mar 2026] We've updated the codebase with a CUDA implementation of Opacity with SH coefficients, reducing both training time and memory consumption. Kudos to Hyeongbhin-Cho for the contribution. See details in the original repo.
[Mar 2026] We've released 2Xplat, a pose-free version of MVP built on a two-expert design — one for camera pose estimation, one for 3DGS generation. It outperforms prior pose-free methods and matches state-of-the-art posed approaches in under 5K training iterations.
# 1. Clone the repository
# If starting fresh (clone everything at once):
git clone --recurse-submodules https://github.com/Gynjn/MVP.git
# If already cloned (initialize submodules):
git submodule update --init --recursive
# 2. Create and activate conda environment
conda create -n mvp python=3.11 -y
conda activate mvp
# 3. Install dependencies (adjust CUDA version in requirements.txt to match your system)
pip install -r requirements.txt
# 4. Install CUDA kernels
cd rendering_cuda
pip install . --no-build-isolation
cd ../sh_cuda
pip install . --no-build-isolation
# 5. Optional
pip install flash-attn-4The model checkpoints are host on HuggingFace (mvp_960x540).
For training and evaluation, we used the DL3DV dataset after applying undistortion preprocessing with this script, originally introduced in Long-LRM.
Download the DL3DV benchmark dataset from here, and apply undistortion preprocessing.
For benchmark data, we provide preprocessed version originally sourced from the RayZer repository. You can find the preprocessed data here. Thanks to Hanwen Jiang for sharing the preprocessed data.
Update the inference.ckpt_path field in configs/inference.yaml with the pretrained model.
Update the entries in data/dl3dv_benchmark.txt to point to the correct processed dataset path.
# inference
CUDA_VISIBLE_DEVICES=0 python inference.py --config configs/inference.yamlUpdate the configs/api_keys.yaml with your own personal wandb api key.
Update the entries in data/dl3dv_train.txt to point to the correct processed dataset path.
If you have enough GPU memory, disable gradient checkpointing in each stage function run_stage1, run_stage2, and run_stage3 in model/mvp.py.
# Example for single GPU training
CUDA_VISIBLE_DEVICES=0 python train_single.py --config configs/train_stage1.yaml
# Example for multi GPU training
torchrun --nproc_per_node 8 --nnodes 1 \
--rdzv_id 1234 --rdzv_endpoint localhost:8888 \
train.py --config configs/train_stage1.yaml- Preprocessed Tanks&Temple and Mip-NeRF360 dataset
@article{kang2025multi,
title={Multi-view Pyramid Transformer: Look Coarser to See Broader},
author={Kang, Gyeongjin and Yang, Seungkwon and Nam, Seungtae and Lee, Younggeun and Kim, Jungwoo and Park, Eunbyung},
journal={arXiv preprint arXiv:2512.07806},
year={2025}
}
This project is built on many amazing research works, thanks a lot to all the authors for sharing!
This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (RS-2025-02653113, High-Performance Research AI Computing Infrastructure Support at the 2 PFLOPS Scale)