Hongfei Zhang1*, Harold H. Chen1,2*, Chenfei Liao1*, Jing He1*, Zixin Zhang1, Haodong Li3, Yihao Liang4,
Kanghao Chen1, Bin Ren5, Xu Zheng1, Shuai Yang1, Kun Zhou6, Yinchuan Li7, Nicu Sebe8,
Ying-Cong Chen1,2β
*Equal Contribution; β Corresponding Author
1HKUST(GZ), 2HKUST, 3UCSD, 4Princeton University, 5MBZUAI, 6SZU, 7Knowin, 8UniTrento
Welcome to the official repository for DVD: Deterministic Video Depth!
While current video depth estimation methods face a strict ambiguity-hallucination dilemmaβwhere discriminative models suffer from semantic ambiguity and poor open-world generalization, and generative models struggle with stochastic hallucinations and temporal flickeringβDVD fundamentally breaks this trade-off.
We present the first deterministic framework that elegantly adapts pre-trained Video Diffusion Models (like WanV2.1) into single-pass depth regressors. By cleanly stripping away generative stochasticity, DVD unites the profound semantic priors of generative models with the structural stability of discriminative regressors.
- π Extreme Data Efficiency: DVD effectively unlocks profound generative priors using only 367K framesβwhich is 163Γ less task-specific training data than leading discriminative baselines like VDA (60M frames).
- β±οΈ Deterministic & Fast: Bypasses iterative ODE integration. Inference is performed in a single forward pass, ensuring absolute temporal stability without generative hallucinations.
- π Unparalleled Structural Fidelity: Powered by our Latent Manifold Rectification (LMR), DVD achieves state-of-the-art high-frequency boundary precision (Boundary Recall & F1) compared to overly smoothed baselines.
- π₯ Long-Video Inference: Equipped with our training-free Global Affine Coherence module, DVD seamlessly stitches sliding windows to support long-video rollouts with negligible scale drift.
TL;DR: If you want state-of-the-art video depth estimation that is highly detailed, temporally stable across long videos, and exceptionally data-efficient, DVD is what you need.
- [2026.03.13] π€ Hugging Face Gradio demos (Online and Local) released.
- [2026.03.13] π Paper is available on arXiv.
- [2026.03.12] π Project page is live.
- [2026.03.11] π€ Pre-trained weights released on Hugging Face.
- [2026.03.10] π₯ Repository initialized and training & inference code released.
We provide the official pre-trained weights for DVD, designed for robust, zero-shot relative video depth estimation.
| Model Version | Backbone | Description | Download |
|---|---|---|---|
| DVD v1.0 | Wan2.1 | Our default model achieving SoTA performance with unprecedented structural fidelity. | π€ Hugging Face |
| DVD v1.1 | - | Performance optimizations & refined temporal consistency. | β³ Coming soon |
To help you navigate the codebase quickly, we have divided the core directories into two main categories based on what you want to do: Inference (just using the model) or Training (fine-tuning or training from scratch).
If you just want to generate depth maps from your own videos or reproduce our paper's results, focus on these folders:
infer_bash/(The Launchpad): Ready-to-use shell scripts (e.g.,openworld.sh). This is the easiest way to run the model on your data without writing any code.ckpt/(The Vault): This is where you should place our pre-trained model weights downloaded from Huggingface.inference_results/(The Output Bin): Once you run an inference script, your generated depth maps and visualizer videos will appear here.demo/: Quick-start examples and sample inputs to help you verify that your environment is set up correctly.
If you want to train DVD on your own datasets or modify the architecture, these are your go-to folders:
train_config/(The Control Center): YAML configuration files. You can easily tweak hyperparameters (like learning rate, batch size) and dataset paths here.train_script/(The Engine): Contains the training bash.diffsynth/pipelines/wan_video_new_determine.py(The Brain): The core DVD model architecture. If you want to understand or modify how we stripped away the generative noise to build the deterministic forward pass, look here.infer_bash/&test_script/(The Evaluator): Scripts used to evaluate your newly trained checkpoints against standard benchmarks during or after training.examples/dataset/: Codes of dataset construction.
git clone https://github.com/EnVision-Research/DVD.git
cd DVD
conda create -n dvd python=3.10 -y
conda activate dvd
pip install -e .
pip install sageattention # DO NOT USE THIS FOR TRAINING!!!
huggingface-cli login # Or hf auth login
2. Download the checkpoint from huggingface repo
huggingface-cli download FayeHongfeiZhang/DVD --revision main --local-dir ckpt
DVD
βββ ckpt/
βββββ model_config.yaml
βββββ model.safetensors
βββ configs/
βββ examples/
βββ ...
π π» Potential Issue (from DiffSynth Studio)
If you encounter issues during installation, it may be caused by the packages we depend on. Please refer to the documentation of the package that caused the problem.
We provide an interactive Gradio interface for you to easily test DVD on your own videos without writing any code.
1. Online Demo: The easiest way to experience DVD! Try it out directly on our Hugging Face.
β οΈ Note on Online Demo: Due to GPU resource constraints on Hugging Face, the online web demo is currently limited to processing videos of up to 5 seconds. To process longer videos, we highly recommend running the local deployment below!
2. Local Deployment: If you prefer to run the UI locally, ensure your environment is set up and simply execute:
python test_script/app.pybash infer_bash/openworld.sh
You may also put more videos in the demo/ directory and alter the video path in the bash to get more results!
1.1. Download the KITTI Dataset, Bonn Dataset, ScanNet Dataset.
1.2. Format the dataset as the structure below
kitti_depth
βββ rgb
βββββ 2011_09_26
βββββ ...
βββ depth
βββββ train
βββββ val
rgbd_bonn_dataset
βββ rgbd_bonn_balloon
βββ rgbd_bonn_balloon_tracking
βββ ...
scannet
βββ scene0000_00
βββ scene0000_01
βββ ...
1.3. Reconfig the bash ($VIDEO_BASE_DATA_DIR) and run the video inference script
bash infer_bash/video.sh
2.1. Download the evaluation datasets (depth) by the following commands (referred to Marigold).
2.2. Reconfig the bash ($IMAGE_BASE_DATA_DIR) and run the image inference script
bash infer_bash/image.sh
Please refer to this document for details on training.
This project adopts a split-license strategy to comply with the licensing terms of our upstream datasets and foundation models:
- Code: The source code of DVD is released under the permissive Apache 2.0 License.
- Model Weights: The pre-trained model weights are released under the CC BY-NC 4.0 License, which strictly limits usage to non-commercial, academic, and research purposes.
By downloading or using the code and models, you agree to abide by these terms.
We sincerely thank the authors of Depth Anything and RollingDepth for providing their implementing details. We would also thanks the contributors of DiffSynth where we borrow codes from.
If you find our work useful in your research, please consider citing:
@article{zhang2026dvd,
title={DVD: Deterministic Video Depth Estimation with Generative Priors},
author={Zhang, Hongfei and Chen, Harold Haodong and Liao, Chenfei and He, Jing and Zhang, Zixin and Li, Haodong and Liang, Yihao and Chen, Kanghao and Ren, Bin and Zheng, Xu and Yang, Shuai and Zhou, Kun and Li, Yinchuan and Sebe, Nicu and Chen, Ying-Cong},
journal={arXiv preprint arXiv:2603.12250},
year={2026}
}