Chonghyuk (ND) Song1
·
Michal Stary1
·
Boyuan Chen1
·
George Kopanas2
·
Vincent Sitzmann1
1MIT CSAIL, Scene Representation Group 2Runway ML
This is the official repository for the paper Generative View Stitching (GVS), which enables collision-free camera-guided video generation for predefined trajectories, and presents a non-autoregressive alternative to video length extrapolation.
Before a recent commit there was a bug in the code that overestimated the MET3R cosine value. Please pull the latest version of the codebase to replicate the quantitative results in the paper.
git clone git@github.com:andrewsonga/generative_view_stitching.git --recursive
cd generative_view_stitching
a) Create a conda environment based on an environment definition file. This option ensures maximum reproducibility by encoding both conda and pip dependencies but is overfit to our source machine's platform settings (Ubuntu 22.04.5, NVIDIA Driver Version 545.23.08, CUDA Version 12.3).
conda env create -f environment.yml
conda activate gvsb) Create a new conda environment and install dependencies with pip. This option better adapts to the user's machine by encoding only pip dependencies but may be less reproducible.
conda create python=3.10 -n gvs
conda activate gvs
pip install -r requirements.txt# met3r
cd third_party/met3r
pip install -r requirements.txt # this will automatically install featup and pytorch3d
pip install -e . # this will actually install met3r
cd ../../
# VideoDepthAnything
cd third_party/Video-Depth-Anything
sh get_weights.sh # this will download checkpoints
cd ../../
We use Weights & Biases for logging. Sign up if you don't have an account, and modify wandb.entity in config.yaml to your user/organization name.
Download and uncompress our benchmark camera trajectories with the following command (takes ~3 minutes):
sh get_benchmark.sh
NOTE: if you encounter CUDA out-of-memory errors (due to limited VRAM), try running with
@baseline/ours_scalableandalgorithm=gvs_scalable_video_pose, which requires less VRAM by denoising every context window one-by-one.
python main.py @baseline/ours algorithm=gvs_video_pose dataset=staircase_circuit @experiment/main_ours_staircase_circuit
2. GVS on the Impossible Staircase (120 frames, takes ~5 minutes per sample on a single NVIDIA H200 GPU)
python main.py @baseline/ours algorithm=gvs_video_pose dataset=impossible_staircase @experiment/application_impossible_staircase
We provide the commands for reproducing every experiment in our paper.
This repo is forked from History-Guided Video Diffusion, which in turn is forked from Boyuan Chen's research template repo. By its license, you must keep the above sentence in README.md and the LICENSE file to credit the author.
This work was supported by the National Science Foundation under Grant No. 2211259, by the Singapore DSTA under DST00OECI20300823 (New Representations for Vision, 3D Self-Supervised Learning for Label-Efficient Vision), by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DOI/IBC) under 140D0423C0075, by the Amazon Science Hub, by the MIT-Google Program for Computing Innovation, by Sony Interactive Entertainment, and by a 2025 MIT ORCD Seed Fund Compute Grant.
If our work is useful for your research, please consider giving us a star and citing our paper:
@article{song2025gvs,
title={Generative View Stitching},
author={Song, Chonghyuk and Stary, Michal and Chen, Boyuan and Kopanas, George and Sitzmann, Vincent},
journal={arXiv preprint arXiv:2510.24718},
year={2025},
}