AnyDepth

Depth Estimation Made Easy

Zeyu Ren1* Zeyu Zhang2*† Wukai Li2 Qingxiang Liu3 Hao Tang2‡

1The University of Melbourne 2Peking University 3Shanghai University of Engineering Science

*Equal contribution. Project lead. Corresponding author.

“Simplicity is prerequisite for reliability.” — Edsger W. Dijkstra

Abstract

TL;DR: We present AnyDepth, a simple and efficient training framework for zero-shot monocular depth estimation, which achieves impressive performance across a variety of indoor and outdoor scenes.

Real World Results

We use the WHEELTEC R550 as the mobile platform for real-world evaluation.

Scene 1

RGB

GT Depth

Depth Anything V2+DPT

Depth Anything V2+SDT (Ours)

Depth Anything V3+DPT

Depth Anything V3+DualDPT

Depth Anything V3+SDT (Ours)

Scene 2

RGB

GT Depth

Depth Anything V2+DPT

Depth Anything V2+SDT (Ours)

Depth Anything V3+DPT

Depth Anything V3+DualDPT

Depth Anything V3+SDT (Ours)

Scene 3

RGB

GT Depth

Depth Anything V2+DPT

Depth Anything V2+SDT (Ours)

Depth Anything V3+DPT

Depth Anything V3+DualDPT

Depth Anything V3+SDT (Ours)

Jetson Orin Nano Latency

Inference latency comparison of SDT and DPT decoders on a Jetson Orin Nano (4GB).

Resolution Decoder Latency (ms)↓ FPS↑
256×256 DPT 305.65 3.3
SDT (Ours) 213.35 4.7
512×512 DPT 1107.64 0.9
SDT (Ours) 831.48 1.2

Jetson Orin Nano Memory

Peak GPU memory usage during inference at 256×256 resolution on Jetson Orin Nano (4GB).

Decoder Peak Memory (MB)↓
DPT 589.5
SDT (Ours) 395.2

Method

Overview

Your browser does not support the image tag

AnyDepth architecture overview. The input image is encoded into tokens by a frozen DINOv3 backbone network, then decoded by our lightweight SDT decoder. Tokens undergo only a single projection and weighted fusion. The Spatial Detail Enhancer (SDE) module ensures finer-grained predictions. The feature map is upsampled by an efficient and learnable upsampler dysample, and the depth is finally output by the head.

Simple Depth Transformer (SDT)

Our decoder adopts a simple single-path fusion and reconstruction strategy, aiming to take advantage of the high-resolution feature of DINOv3 and further unleash its performance at high resolution. We first project the tokens extracted from the encoder into a 256-dimensional space using a linear layer followed by a GELU non-linearity, which preserves sufficient informative content while substantially reducing the computational overhead in the subsequent decoding stages. For the class token, we keep the same processing as DPT, concatenate it with the spatial token, and then fuse it through the learnable projection.

SDT vs. DPT

A key difference between SDT and DPT is the order of feature reassembly. DPT employs a reassemble-fusion strategy. Specifically, DPT first applies the reassemble module to the tokens extracted by each Transformer layer, mapping the tokens to feature maps of different scales. These feature maps are then fused in a cascade across scales, which inevitably introduces multiple branches and repeated cross-scale alignment overhead. In contrast, SDT employs a fusion-reassemble strategy, directly projecting and fusing groups of tokens. Only after this stage do we perform spatial reassembly and upsampling along a single path. This fusion-reassemble strategy avoids the high cost of per-layer token reassembly and feature map cross-scale alignment, making it more efficient and stable, especially when processing high-resolution inputs.

Efficiency

(a) Model Comparison

Model Comparison

(b) FLOPs Comparison

FLOPs Comparison

(c) Inference Time Comparison

Inference Time Comparison

Comparison of the number of parameters (left), computational complexity (mid), and Inference time (right) of AnyDepth and DPT for different model sizes and input resolutions. Our method significantly reduces the number of model parameters and computational cost while maintaining competitive accuracy. Moreover, our method consistently achieves lower latency, especially at higher resolutions.

Visualization

Teaser
Outdoor visualization

Qualitative results of zero-shot monocular depth estimation using AnyDepth of ViT-B and comparison with DPT-B.

Image to 3D Point Cloud

Point cloud quality comparison

Point cloud quality comparison. Point clouds generated using SDT have more regular geometry and lower noise compared to those generated by other methods.

Dataset Quality

(a) Total Score

Total Score

(b) Depth Distribution Score

Depth Distribution Score

(c) Gradient Continuity Score

Gradient Continuity Score

Dataset quality across the Total Score, Depth Distribution Score, and Gradient Continuity Score (higher is better).

Experimental Results

Quantitative Comparison of Zero-shot Affine-invariant Depth Estimation

Quantitative comparison of zero-shot affine-invariant depth estimation. Lower AbsRel values are better; higher δ1 values are better. DINOv3 uses the ViT-7B encoder, and Depth Anything v2 (DAv2) is trained on 62.6M datasets. For fair comparison, the baseline (DPT) uses a frozen DINOv3 encoder and DPT head, while our method replaces the DPT head with the proposed SDT. The bold numbers in the table refer to the best results between DPT and AnyDepth.

Method Training Data↓ Encoder #Params (M)↓ NYUv2 KITTI ETH3D ScanNet DIODE
AbsRel↓ δ1 AbsRel↓ δ1 AbsRel↓ δ1 AbsRel↓ δ1 AbsRel↓ δ1
DINOv3 595K ViT-7B 91.19 4.3 98.0 7.3 96.7 5.4 97.5 4.4 98.1 25.6 82.2
DAv2 62.6M ViT-S 71.8 5.3 97.3 7.8 93.6 14.2 85.1 -- -- 7.3 94.2
ViT-B 162.1 4.9 97.6 7.8 93.9 13.7 85.8 -- -- 6.8 95.0
ViT-L 399.6 4.5 97.9 7.4 94.6 13.1 86.5 -- -- 6.6 95.2
DPT 584K ViT-S 71.8 8.4 93.3 10.8 89.1 12.7 92.0 8.3 93.5 26.0 71.4
ViT-B 162.1 7.5 95.1 10.8 88.9 10.0 92.9 7.1 95.3 24.5 73.4
ViT-L 399.6 6.1 96.8 8.9 92.5 13.0 94.9 6.0 97.0 23.4 73.9
AnyDepth 369K ViT-S 26.5 8.2 93.2 10.2 88.3 8.4 93.5 8.0 93.6 24.7 71.4
ViT-B 95.5 7.2 95.0 9.7 90.1 8.0 94.5 6.8 95.6 23.6 72.7
ViT-L 313.4 6.0 96.8 8.6 92.6 9.6 95.4 5.4 97.4 22.6 73.6

Quantitative Comparison of Zero-shot Affine-invariant Depth (Encoders & Decoders)

Quantitative comparison of zero-shot affine-invariant depth estimation with different encoders and decoders. Lower AbsRel values indicate better performance, while higher δ1 values are better. We use ViT-B as the encoder for DAv2, ViT-L for DAv3, and VGGT-1B for VGGT. The encoder used pre-trained weights, and the decoder was randomly initialized. Bold numbers denote the better result.

Method Encoder Decoder NYUv2 KITTI ETH3D ScanNet DIODE
AbsRel↓ δ1 AbsRel↓ δ1 AbsRel↓ δ1 AbsRel↓ δ1 AbsRel↓ δ1
DAv2 ViT-B DPT 5.8 96.2 10.4 89.1 8.8 94.6 6.2 95.3 23.4 73.8
SDT 5.6 96.4 10.7 89.6 7.5 95.8 6.1 95.4 23.9 73.9
DAv3 ViT-L DPT 4.9 96.9 8.8 92.4 6.9 95.9 5.0 96.6 22.5 74.6
Dual-DPT 4.9 97.0 8.9 92.4 7.0 95.8 4.9 96.6 22.3 74.6
SDT 4.9 97.1 8.9 92.4 5.8 96.6 5.0 96.6 21.9 74.9
VGGT VGGT-1B DPT 4.8 97.7 15.6 77.9 7.2 94.7 4.6 97.6 30.7 76.2
SDT 4.8 98.0 15.5 80.1 7.0 95.1 4.6 98.0 30.6 76.8

Multi-resolution Efficiency

Multi-resolution efficiency comparison of SDT and DPT heads under a ViT-L encoder. Latency is averaged over 1000 runs on an NVIDIA H100 GPU. Lower is better.

Resolution Decoder FLOPs (G)↓ Latency (ms)↓
256×256 DPT 444.14 6.66 ± 0.22
SDT (Ours) 234.17 6.10 ± 0.33
512×512 DPT 1776.56 24.65 ± 0.22
SDT (Ours) 936.70 23.17 ± 0.54
1024×1024 DPT 7106.22 99.79 ± 0.79
SDT (Ours) 3746.79 93.09 ± 0.51

Decoder Parameters

Decoder parameter comparison across different ViT backbones. Lower is better.

Decoder ViT Backbone Params (M)↓
DPT ViT-S 50.83
ViT-B 76.05
ViT-L 99.58
SDT ViT-S 5.51
ViT-B 9.45
ViT-L 13.38