AnyDepth: Depth Estimation Made Easy

Abstract

TL;DR: We present AnyDepth, a simple and efficient training framework for zero-shot monocular depth estimation, which achieves impressive performance across a variety of indoor and outdoor scenes.

Real World Results

We use the WHEELTEC R550 as the mobile platform for real-world evaluation.

Scene 1

RGB

GT Depth

Depth Anything V2+DPT

Depth Anything V2+SDT (Ours)

Depth Anything V3+DPT

Depth Anything V3+DualDPT

Depth Anything V3+SDT (Ours)

Scene 2

RGB

GT Depth

Depth Anything V2+DPT

Depth Anything V2+SDT (Ours)

Depth Anything V3+DPT

Depth Anything V3+DualDPT

Depth Anything V3+SDT (Ours)

Scene 3

RGB

GT Depth

Depth Anything V2+DPT

Depth Anything V2+SDT (Ours)

Depth Anything V3+DPT

Depth Anything V3+DualDPT

Depth Anything V3+SDT (Ours)

Jetson Orin Nano Latency

Inference latency comparison of SDT and DPT decoders on a Jetson Orin Nano (4GB).

Resolution	Decoder	Latency (ms)↓	FPS↑
256×256	DPT	305.65	3.3
256×256	SDT (Ours)	213.35	4.7
512×512	DPT	1107.64	0.9
512×512	SDT (Ours)	831.48	1.2

Jetson Orin Nano Memory

Peak GPU memory usage during inference at 256×256 resolution on Jetson Orin Nano (4GB).

Decoder	Peak Memory (MB)↓
DPT	589.5
SDT (Ours)	395.2

Method

Overview

Your browser does not support the image tag

AnyDepth architecture overview. The input image is encoded into tokens by a frozen DINOv3 backbone network, then decoded by our lightweight SDT decoder. Tokens undergo only a single projection and weighted fusion. The Spatial Detail Enhancer (SDE) module ensures finer-grained predictions. The feature map is upsampled by an efficient and learnable upsampler dysample, and the depth is finally output by the head.

Simple Depth Transformer (SDT)

Our decoder adopts a simple single-path fusion and reconstruction strategy, aiming to take advantage of the high-resolution feature of DINOv3 and further unleash its performance at high resolution. We first project the tokens extracted from the encoder into a 256-dimensional space using a linear layer followed by a GELU non-linearity, which preserves sufficient informative content while substantially reducing the computational overhead in the subsequent decoding stages. For the class token, we keep the same processing as DPT, concatenate it with the spatial token, and then fuse it through the learnable projection.

SDT vs. DPT

A key difference between SDT and DPT is the order of feature reassembly. DPT employs a reassemble-fusion strategy. Specifically, DPT first applies the reassemble module to the tokens extracted by each Transformer layer, mapping the tokens to feature maps of different scales. These feature maps are then fused in a cascade across scales, which inevitably introduces multiple branches and repeated cross-scale alignment overhead. In contrast, SDT employs a fusion-reassemble strategy, directly projecting and fusing groups of tokens. Only after this stage do we perform spatial reassembly and upsampling along a single path. This fusion-reassemble strategy avoids the high cost of per-layer token reassembly and feature map cross-scale alignment, making it more efficient and stable, especially when processing high-resolution inputs.

Efficiency

(a) Model Comparison

(b) FLOPs Comparison

(c) Inference Time Comparison

Comparison of the number of parameters (left), computational complexity (mid), and Inference time (right) of AnyDepth and DPT for different model sizes and input resolutions. Our method significantly reduces the number of model parameters and computational cost while maintaining competitive accuracy. Moreover, our method consistently achieves lower latency, especially at higher resolutions.

Visualization

Qualitative results of zero-shot monocular depth estimation using AnyDepth of ViT-B and comparison with DPT-B.

Image to 3D Point Cloud

Point cloud quality comparison. Point clouds generated using SDT have more regular geometry and lower noise compared to those generated by other methods.

Dataset Quality

(a) Total Score

(b) Depth Distribution Score

(c) Gradient Continuity Score

Dataset quality across the Total Score, Depth Distribution Score, and Gradient Continuity Score (higher is better).

Experimental Results

Quantitative Comparison of Zero-shot Affine-invariant Depth Estimation

Quantitative comparison of zero-shot affine-invariant depth estimation. Lower AbsRel values are better; higher δ₁ values are better. DINOv3 uses the ViT-7B encoder, and Depth Anything v2 (DAv2) is trained on 62.6M datasets. For fair comparison, the baseline (DPT) uses a frozen DINOv3 encoder and DPT head, while our method replaces the DPT head with the proposed SDT. The bold numbers in the table refer to the best results between DPT and AnyDepth.

Method	Training Data↓	Encoder	#Params (M)↓	NYUv2		KITTI		ETH3D		ScanNet		DIODE
Method	Training Data↓	Encoder	#Params (M)↓	AbsRel↓	δ₁↑	AbsRel↓	δ₁↑	AbsRel↓	δ₁↑	AbsRel↓	δ₁↑	AbsRel↓	δ₁↑
DINOv3	595K	ViT-7B	91.19	4.3	98.0	7.3	96.7	5.4	97.5	4.4	98.1	25.6	82.2
DAv2	62.6M	ViT-S	71.8	5.3	97.3	7.8	93.6	14.2	85.1	--	--	7.3	94.2
		ViT-B	162.1	4.9	97.6	7.8	93.9	13.7	85.8	--	--	6.8	95.0
		ViT-L	399.6	4.5	97.9	7.4	94.6	13.1	86.5	--	--	6.6	95.2
DPT	584K	ViT-S	71.8	8.4	93.3	10.8	89.1	12.7	92.0	8.3	93.5	26.0	71.4
		ViT-B	162.1	7.5	95.1	10.8	88.9	10.0	92.9	7.1	95.3	24.5	73.4
		ViT-L	399.6	6.1	96.8	8.9	92.5	13.0	94.9	6.0	97.0	23.4	73.9
AnyDepth	369K	ViT-S	26.5	8.2	93.2	10.2	88.3	8.4	93.5	8.0	93.6	24.7	71.4
		ViT-B	95.5	7.2	95.0	9.7	90.1	8.0	94.5	6.8	95.6	23.6	72.7
		ViT-L	313.4	6.0	96.8	8.6	92.6	9.6	95.4	5.4	97.4	22.6	73.6

Quantitative Comparison of Zero-shot Affine-invariant Depth (Encoders & Decoders)

Quantitative comparison of zero-shot affine-invariant depth estimation with different encoders and decoders. Lower AbsRel values indicate better performance, while higher δ₁ values are better. We use ViT-B as the encoder for DAv2, ViT-L for DAv3, and VGGT-1B for VGGT. The encoder used pre-trained weights, and the decoder was randomly initialized. Bold numbers denote the better result.

Method	Encoder	Decoder	NYUv2		KITTI		ETH3D		ScanNet		DIODE
Method	Encoder	Decoder	AbsRel↓	δ₁↑	AbsRel↓	δ₁↑	AbsRel↓	δ₁↑	AbsRel↓	δ₁↑	AbsRel↓	δ₁↑
DAv2	ViT-B	DPT	5.8	96.2	10.4	89.1	8.8	94.6	6.2	95.3	23.4	73.8
DAv2	ViT-B	SDT	5.6	96.4	10.7	89.6	7.5	95.8	6.1	95.4	23.9	73.9
DAv3	ViT-L	DPT	4.9	96.9	8.8	92.4	6.9	95.9	5.0	96.6	22.5	74.6
		Dual-DPT	4.9	97.0	8.9	92.4	7.0	95.8	4.9	96.6	22.3	74.6
		SDT	4.9	97.1	8.9	92.4	5.8	96.6	5.0	96.6	21.9	74.9
VGGT	VGGT-1B	DPT	4.8	97.7	15.6	77.9	7.2	94.7	4.6	97.6	30.7	76.2
VGGT	VGGT-1B	SDT	4.8	98.0	15.5	80.1	7.0	95.1	4.6	98.0	30.6	76.8

Multi-resolution Efficiency

Multi-resolution efficiency comparison of SDT and DPT heads under a ViT-L encoder. Latency is averaged over 1000 runs on an NVIDIA H100 GPU. Lower is better.

Resolution	Decoder	FLOPs (G)↓	Latency (ms)↓
256×256	DPT	444.14	6.66 ± 0.22
256×256	SDT (Ours)	234.17	6.10 ± 0.33
512×512	DPT	1776.56	24.65 ± 0.22
512×512	SDT (Ours)	936.70	23.17 ± 0.54
1024×1024	DPT	7106.22	99.79 ± 0.79
1024×1024	SDT (Ours)	3746.79	93.09 ± 0.51

Decoder Parameters

Decoder parameter comparison across different ViT backbones. Lower is better.

Decoder	ViT Backbone	Params (M)↓
DPT	ViT-S	50.83
	ViT-B	76.05
	ViT-L	99.58
SDT	ViT-S	5.51
	ViT-B	9.45
	ViT-L	13.38