"Simplicity is prerequisite for reliability." --- Edsger W. Dijkstra
AnyDepth: Depth Estimation Made Easy
Zeyu Ren1*, Zeyu Zhang2*†, Wukai Li2, Qingxiang Liu3, Hao Tang2‡
1The University of Melbourne, 2Peking University, 3Shanghai University of Engineering Science
*Equal contribution. †Project lead. ‡Corresponding author.
If you find our code or paper helpful, please consider starring ⭐ us and citing:
@article{ren2026anydepth,
title={AnyDepth: Depth Estimation Made Easy},
author={Ren, Zeyu and Zhang, Zeyu and Li, Wukai and Liu, Qingxiang and Tang, Hao},
journal={arXiv preprint arXiv:2601.02760},
year={2026}
}We present AnyDepth, a simple and efficient training framework for zero-shot monocular depth estimation. The core contribution is the Simple Depth Transformer (SDT), a compact decoder that achieves comparable accuracy to DPT while reducing parameters by 85%--89%.
Key Features:
- Single-Path Fusion: Fuse-then-reassemble strategy avoids multi-branch cross-scale alignment
- Weighted Fusion: Learnable fusion of 4-layer ViT features with cls token readout
- Spatial Detail Enhancer (SDE): Depthwise convolution for local spatial modeling
- DySample Upsampling: Two-stage learnable upsampling (H/16 -> H/4 -> H)
- Lightweight: Only ~5-13M parameters for the decoder
SDT has no additional dependencies beyond PyTorch. Simply replace DPT or other decoders with SDT in your existing codebase.
import torch
from sdt_head import SDTHead
# in_channels: ViT-S=384, ViT-B=768, ViT-L=1024
# extract layers: [2,5,8,11] for ViT-S/B, [4,11,17,23] for ViT-L
head = SDTHead(
in_channels=in_channels,
fusion_channels=fusion_channels,
n_output_channels=1,
use_cls_token=True
)We provide the training splits (369K samples) in the datasets/ folder. To prepare the data:
-
Hypersim & Virtual KITTI: Follow the instructions from Lotus to download and prepare these datasets.
-
IRS: Follow the official instructions at IRS.
-
BlendedMVS: Follow the official instructions at BlendedMVS.
-
TartanAir: Follow the official instructions at TartanAir.
Key Difference: DPT uses a reassemble-fusion strategy (per-layer reassembly + cross-scale fusion), while SDT uses a fusion-reassemble strategy (fuse tokens first, then single-path upsampling).
| Decoder | ViT Backbone | Params (M) |
|---|---|---|
| DPT | ViT-S | 50.83 |
| DPT | ViT-B | 76.05 |
| DPT | ViT-L | 99.58 |
| SDT | ViT-S | 5.51 |
| SDT | ViT-B | 9.45 |
| SDT | ViT-L | 13.38 |
| Resolution | Decoder | FLOPs (G) | Latency (ms) |
|---|---|---|---|
| 256×256 | DPT | 444.14 | 6.66 ± 0.22 |
| 256×256 | SDT (Ours) | 234.17 | 6.10 ± 0.33 |
| 512×512 | DPT | 1776.56 | 24.65 ± 0.22 |
| 512×512 | SDT (Ours) | 936.70 | 23.17 ± 0.54 |
| 1024×1024 | DPT | 7106.22 | 99.79 ± 0.79 |
| 1024×1024 | SDT (Ours) | 3746.79 | 93.09 ± 0.51 |
| Method | Data | Encoder | Params | NYUv2 | KITTI | ETH3D | ScanNet | DIODE |
|---|---|---|---|---|---|---|---|---|
| DPT | 584K | ViT-S | 71.8M | 8.4 | 10.8 | 12.7 | 8.3 | 26.0 |
| AnyDepth | 369K | ViT-S | 26.5M | 8.2 | 10.2 | 8.4 | 8.0 | 24.7 |
| DPT | 584K | ViT-B | 162.1M | 7.5 | 10.8 | 10.0 | 7.1 | 24.5 |
| AnyDepth | 369K | ViT-B | 95.5M | 7.2 | 9.7 | 8.0 | 6.8 | 23.6 |
| DPT | 584K | ViT-L | 399.6M | 6.1 | 8.9 | 13.0 | 6.0 | 23.4 |
| AnyDepth | 369K | ViT-L | 313.4M | 6.0 | 8.6 | 9.6 | 5.4 | 22.6 |
Metric: AbsRel % (lower is better)
We fine-tune on Hypersim and Virtual KITTI with depth foundation models (DAv2, DA3, VGGT).
| Method | Encoder | Decoder | NYUv2 | KITTI | ETH3D | ScanNet | DIODE |
|---|---|---|---|---|---|---|---|
| DAv2 | ViT-B | DPT | 5.8 | 10.4 | 8.8 | 6.2 | 23.4 |
| DAv2 | ViT-B | SDT | 5.6 | 10.7 | 7.5 | 6.1 | 23.9 |
| DA3 | ViT-L | DPT | 4.9 | 8.8 | 6.9 | 5.0 | 22.5 |
| DA3 | ViT-L | Dual-DPT | 4.9 | 8.9 | 7.0 | 4.9 | 22.3 |
| DA3 | ViT-L | SDT | 4.9 | 8.9 | 5.8 | 5.0 | 21.9 |
| VGGT | VGGT-1B | DPT | 4.8 | 15.6 | 7.2 | 4.6 | 30.7 |
| VGGT | VGGT-1B | SDT | 4.8 | 15.5 | 7.0 | 4.6 | 30.6 |
Metric: AbsRel % (lower is better). The encoder used pre-trained weights, and the decoder was randomly initialized.
SDT has been tested on Jetson Orin Nano (4GB):
| Resolution | Decoder | Latency (ms) | FPS |
|---|---|---|---|
| 256×256 | DPT | 305.65 | 3.3 |
| 256×256 | SDT (Ours) | 213.35 | 4.7 |
| 512×512 | DPT | 1107.64 | 0.9 |
| 512×512 | SDT (Ours) | 831.48 | 1.2 |
This work is licensed under CC BY-NC-SA 4.0.


