WildDet3D:
Scaling Promptable 3D Detection in the Wild

Watch the full demo video

WildDet3D:
Scaling Promptable 3D Detection in the Wild

Weikai Huang^♥^1,2   Jieyu Zhang^♥^1,2
Sijun Li²   Taoyang Jia²   Jiafei Duan^1,2   Yunqian Cheng¹   Jaemin Cho^1,2   Matthew Wallingford¹   Rustin Soraki^1,2   Chris Dongjoo Kim¹   Shuo Liu^1,2   Donovan Clay^1,2   Taira Anderson¹   Winson Han¹
Ali Farhadi^1,2   Bharath Hariharan³   Zhongzheng Ren^♥^1,2,4   Ranjay Krishna^♥^1,2

♥ core contributors ¹Allen Institute for AI ²University of Washington ³Cornell University ⁴UNC-Chapel Hill

Demo & Applications


HuggingFace Interactive Demo Interactive web demo with text, point, and box prompts Live Demo \| Run Locally	iPhone App Real-time on-device 3D detection App Store \| Video \| README


Integrate with VLM Combine with vision-language models README	Zero-Shot Tracking 3D object tracking without training README


Meta Quest 3D detection in AR/VR Video	Robotics 3D perception for robotic manipulation Video

TODO

Release inference code
Release WildDet3D-Bench evaluation
Release evaluation on other benchmarks (Omni3D, Argoverse2, ScanNet)
Release training code

Model Weights

Model	Backbone	Depth Backend	Params	Download
WildDet3D	SAM3 ViT	LingBot-Depth (DINOv2 ViT-L/14)	~1.2B	allenai/WildDet3D

# Download checkpoint
pip install huggingface_hub
huggingface-cli download allenai/WildDet3D wilddet3d_alldata_all_prompt_v1.0.pt --local-dir ckpt/

Installation

git clone --recurse-submodules https://github.com/allenai/WildDet3D.git
cd WildDet3D
conda create -n wilddet3d python=3.11 -y
conda activate wilddet3d

# Install all dependencies
pip install -r requirements.txt

Inference

from wilddet3d import build_model, preprocess
from wilddet3d.vis.visualize import draw_3d_boxes
import numpy as np
from PIL import Image

# Build model
model = build_model(
    checkpoint="ckpt/wilddet3d_alldata_all_prompt_v1.0.pt",
    score_threshold=0.3,
    skip_pretrained=True,
)

# Load and preprocess image
image = np.array(Image.open("image.jpg")).astype(np.float32)

# With known camera intrinsics
intrinsics = np.load("intrinsics.npy")  # (3, 3)
data = preprocess(image, intrinsics)

# Without intrinsics (uses default: focal=max(H,W), principal point at center)
# data = preprocess(image)

# Text prompt: detect all instances of given categories
results = model(
    images=data["images"].cuda(),
    intrinsics=data["intrinsics"].cuda()[None],
    input_hw=[data["input_hw"]],
    original_hw=[data["original_hw"]],
    padding=[data["padding"]],
    input_texts=["car", "person", "bicycle"],
)
boxes, boxes3d, scores, scores_2d, scores_3d, class_ids, depth_maps = results

# Box prompt (geometric): lift a 2D box to 3D (one-to-one)
results = model(
    images=data["images"].cuda(),
    intrinsics=data["intrinsics"].cuda()[None],
    input_hw=[data["input_hw"]],
    original_hw=[data["original_hw"]],
    padding=[data["padding"]],
    input_boxes=[[100, 200, 300, 400]],  # pixel xyxy
    prompt_text="geometric",
)

# Exemplar prompt: use a 2D box as visual exemplar, find all similar objects (one-to-many)
results = model(
    images=data["images"].cuda(),
    intrinsics=data["intrinsics"].cuda()[None],
    input_hw=[data["input_hw"]],
    original_hw=[data["original_hw"]],
    padding=[data["padding"]],
    input_boxes=[[100, 200, 300, 400]],
    prompt_text="visual",
)

# Point prompt
results = model(
    images=data["images"].cuda(),
    intrinsics=data["intrinsics"].cuda()[None],
    input_hw=[data["input_hw"]],
    original_hw=[data["original_hw"]],
    padding=[data["padding"]],
    input_points=[[(150, 250, 1), (200, 300, 0)]],  # (x, y, label): 1=positive, 0=negative
    prompt_text="geometric",
)

# Visualize results
boxes, boxes3d, scores, scores_2d, scores_3d, class_ids, depth_maps = results
draw_3d_boxes(
    image=image.astype(np.uint8),
    boxes3d=boxes3d[0],
    intrinsics=intrinsics,
    scores_2d=scores_2d[0],
    scores_3d=scores_3d[0],
    class_ids=class_ids[0],
    class_names=["car", "person", "bicycle"],
    save_path="output.png",
)

Notes:

If intrinsics is not provided, a default intrinsic matrix is used (focal=max(H,W), principal point at image center).
Optional depth input: pass depth_gt=depth_tensor (shape (B, 1, H, W), meters) for improved 3D localization with sparse/dense depth (e.g., LiDAR).

See docs/INFERENCE.md for the full API reference.

Evaluation

WildDet3D-Bench

Download the evaluation data from allenai/WildDet3D-Data. Evaluate using the vis4d framework:

# Text prompt
vis4d test --config configs/eval/in_the_wild/text.py --gpus 1 --ckpt ckpt/wilddet3d_alldata_all_prompt_v1.0.pt

# Text prompt + GT depth
vis4d test --config configs/eval/in_the_wild/text_with_depth.py --gpus 1 --ckpt ckpt/wilddet3d_alldata_all_prompt_v1.0.pt

# Box prompt (oracle)
vis4d test --config configs/eval/in_the_wild/box_prompt.py --gpus 1 --ckpt ckpt/wilddet3d_alldata_all_prompt_v1.0.pt

# Box prompt + GT depth
vis4d test --config configs/eval/in_the_wild/box_prompt_with_depth.py --gpus 1 --ckpt ckpt/wilddet3d_alldata_all_prompt_v1.0.pt

Mode	Config
Text	`configs/eval/in_the_wild/text.py`
Text + Depth	`configs/eval/in_the_wild/text_with_depth.py`
Box Prompt	`configs/eval/in_the_wild/box_prompt.py`
Box Prompt + Depth	`configs/eval/in_the_wild/box_prompt_with_depth.py`

WildDet3D-Stereo4D-Bench

Download the evaluation data from allenai/WildDet3D-Stereo4D-Bench-Images. Evaluate (383 images with real stereo depth):

# Text prompt
vis4d test --config configs/eval/stereo4d/text.py --gpus 1 --ckpt ckpt/wilddet3d_alldata_all_prompt_v1.0.pt

# Text prompt + GT depth
vis4d test --config configs/eval/stereo4d/text_with_depth.py --gpus 1 --ckpt ckpt/wilddet3d_alldata_all_prompt_v1.0.pt

# Box prompt (oracle)
vis4d test --config configs/eval/stereo4d/box_prompt.py --gpus 1 --ckpt ckpt/wilddet3d_alldata_all_prompt_v1.0.pt

# Box prompt + GT depth
vis4d test --config configs/eval/stereo4d/box_prompt_with_depth.py --gpus 1 --ckpt ckpt/wilddet3d_alldata_all_prompt_v1.0.pt

Mode	Config
Text	`configs/eval/stereo4d/text.py`
Text + Depth	`configs/eval/stereo4d/text_with_depth.py`
Box Prompt	`configs/eval/stereo4d/box_prompt.py`
Box Prompt + Depth	`configs/eval/stereo4d/box_prompt_with_depth.py`

Results

WildDet3D-Bench (In-the-Wild)

AP is computed using center-distance matching. AP_r, AP_c, AP_f denote rare (<5), common (5-20), and frequent (>20) category splits.

Method	Data	AP_r	AP_c	AP_f	AP
Text Prompt
3D-MOOD	Omni3D	2.4	2.1	2.6	2.3
WildDet3D	Omni3D	9.0	6.5	5.2	6.8
WildDet3D w/ depth	Omni3D	23.0	21.5	16.1	20.7
WildDet3D	Omni3D, Others, WildDet3D-Data	28.3	21.6	18.7	22.6
WildDet3D w/ depth	Omni3D, Others, WildDet3D-Data	47.4	40.7	37.2	41.6
Box Prompt
OVMono3D-LIFT	Omni3D	7.4	8.8	5.1	7.7
DetAny3D	Omni3D, Others	9.9	7.4	6.3	7.8
WildDet3D	Omni3D	12.0	7.9	5.3	8.4
WildDet3D w/ depth	Omni3D	26.4	24.4	19.6	23.9
WildDet3D	Omni3D, Others, WildDet3D-Data	30.0	24.2	20.3	24.8
WildDet3D w/ depth	Omni3D, Others, WildDet3D-Data	53.7	46.1	42.5	47.2

Omni3D

AP is computed at 3D IoU [0.5:0.95].

Method	KITTI	nuScenes	SUNRGBD	Hypersim	ARKitScenes	Objectron	AP
Text Prompt
Cube R-CNN	32.6	30.1	15.3	7.5	41.7	50.8	23.3
3D-MOOD Swin-T	32.8	31.5	21.9	10.5	51.0	64.3	28.4
3D-MOOD Swin-B	31.4	35.8	23.8	9.1	53.9	67.9	30.0
WildDet3D	37.0	31.7	38.9	16.5	64.6	60.5	34.2
WildDet3D w/ depth	36.1	32.0	51.1	26.6	73.3	68.3	41.6
Box Prompt
OVMono3D-LIFT	31.4	32.5	23.2	11.9	54.2	63.5	29.6
DetAny3D	38.7	37.6	46.1	16.0	50.6	56.8	34.4
WildDet3D	44.3	35.3	43.1	17.3	66.6	60.8	36.4
WildDet3D w/ depth	42.8	35.9	58.7	30.4	76.6	68.5	45.8

Qualitative Results

Box prompt comparison (WildDet3D vs OVMono3D vs DetAny3D):

Text prompt comparison:

WildDet3D Data

We introduce WildDet3D-Data, a large-scale in-the-wild dataset for monocular 3D detection with human-verified 3D bounding box annotations. The dataset covers images from COCO, LVIS, Objects365, and V3Det.

Split	Description	Images	Annotations	Categories
Val	Validation set (human)	2,470	9,256	785
Test	Test set (human)	2,433	5,596	633
Train (Human)	Human-reviewed only	102,979	229,934	11,879
Train (Essential)	Human + VLM-qualified small objects	102,979	412,711	12,064
Train (Synthetic)	VLM auto-selected	896,004	3,483,292	11,896
Total		1,003,886	3,910,855	13,499

The dataset is hosted on HuggingFace: allenai/WildDet3D-Data. See the dataset README for download instructions and data format.

Citation

If you find this work useful, please cite:

@misc{huang2026wilddet3dscalingpromptable3d,
      title={WildDet3D: Scaling Promptable 3D Detection in the Wild}, 
      author={Weikai Huang and Jieyu Zhang and Sijun Li and Taoyang Jia and Jiafei Duan and Yunqian Cheng and Jaemin Cho and Mattew Wallingford and Rustin Soraki and Chris Dongjoo Kim and Donovan Clay and Taira Anderson and Winson Han and Ali Farhadi and Bharath Hariharan and Zhongzheng Ren and Ranjay Krishna},
      year={2026},
      eprint={2604.08626},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.08626}, 
}

Acknowledgements

Omni3D -- 3D detection benchmarks and baselines
vis4d -- Training and evaluation framework
SAM 3 -- Segment Anything Model 3
LingBot-Depth -- Monocular depth estimation
3D-MOOD -- Open-vocabulary monocular 3D detection
DetAny3D -- Detect anything in 3D
OVMono3D -- Open-vocabulary monocular 3D detection
LabelAny3D -- 3D bounding box annotation tool

License

Codebase: This codebase incorporates code from SAM 3, and is licensed under the SAM License. It is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines.

Model: This model is based on SAM 3 and LingBot-Depth, and is licensed under the SAM License. This model is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
configs		configs
data/in_the_wild/annotations		data/in_the_wild/annotations
demo		demo
docs		docs
third_party		third_party
wilddet3d		wilddet3d
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WildDet3D:
Scaling Promptable 3D Detection in the Wild

Demo & Applications

TODO

Contents

Model Weights

Installation

Inference

Evaluation

WildDet3D-Bench

WildDet3D-Stereo4D-Bench

Results

WildDet3D-Bench (In-the-Wild)

Omni3D

Qualitative Results

WildDet3D Data

Citation

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WildDet3D: Scaling Promptable 3D Detection in the Wild

Demo & Applications

TODO

Contents

Model Weights

Installation

Inference

Evaluation

WildDet3D-Bench

WildDet3D-Stereo4D-Bench

Results

WildDet3D-Bench (In-the-Wild)

Omni3D

Qualitative Results

WildDet3D Data

Citation

Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

WildDet3D:
Scaling Promptable 3D Detection in the Wild

Packages