Weikai Huang♥1,2
Jieyu Zhang♥1,2
Sijun Li2
Taoyang Jia2
Jiafei Duan1,2
Yunqian Cheng1
Jaemin Cho1,2
Matthew Wallingford1
Rustin Soraki1,2
Chris Dongjoo Kim1
Shuo Liu1,2
Donovan Clay1,2
Taira Anderson1
Winson Han1
Ali Farhadi1,2
Bharath Hariharan3
Zhongzheng Ren♥1,2,4
Ranjay Krishna♥1,2
♥ core contributors 1Allen Institute for AI 2University of Washington 3Cornell University 4UNC-Chapel Hill
|
|
|
HuggingFace Interactive Demo
Interactive web demo with text, point, and box prompts Live Demo | Run Locally |
iPhone App
Real-time on-device 3D detection App Store | Video | README |
|
|
|
Integrate with VLM
Combine with vision-language models README |
Zero-Shot Tracking
3D object tracking without training README |
|
|
|
Meta Quest
3D detection in AR/VR Video |
Robotics
3D perception for robotic manipulation Video |
- Release inference code
- Release WildDet3D-Bench evaluation
- Release evaluation on other benchmarks (Omni3D, Argoverse2, ScanNet)
- Release training code
| Model | Backbone | Depth Backend | Params | Download |
|---|---|---|---|---|
| WildDet3D | SAM3 ViT | LingBot-Depth (DINOv2 ViT-L/14) | ~1.2B | allenai/WildDet3D |
# Download checkpoint
pip install huggingface_hub
huggingface-cli download allenai/WildDet3D wilddet3d_alldata_all_prompt_v1.0.pt --local-dir ckpt/git clone --recurse-submodules https://github.com/allenai/WildDet3D.git
cd WildDet3D
conda create -n wilddet3d python=3.11 -y
conda activate wilddet3d
# Install all dependencies
pip install -r requirements.txtfrom wilddet3d import build_model, preprocess
from wilddet3d.vis.visualize import draw_3d_boxes
import numpy as np
from PIL import Image
# Build model
model = build_model(
checkpoint="ckpt/wilddet3d_alldata_all_prompt_v1.0.pt",
score_threshold=0.3,
skip_pretrained=True,
)
# Load and preprocess image
image = np.array(Image.open("image.jpg")).astype(np.float32)
# With known camera intrinsics
intrinsics = np.load("intrinsics.npy") # (3, 3)
data = preprocess(image, intrinsics)
# Without intrinsics (uses default: focal=max(H,W), principal point at center)
# data = preprocess(image)
# Text prompt: detect all instances of given categories
results = model(
images=data["images"].cuda(),
intrinsics=data["intrinsics"].cuda()[None],
input_hw=[data["input_hw"]],
original_hw=[data["original_hw"]],
padding=[data["padding"]],
input_texts=["car", "person", "bicycle"],
)
boxes, boxes3d, scores, scores_2d, scores_3d, class_ids, depth_maps = results
# Box prompt (geometric): lift a 2D box to 3D (one-to-one)
results = model(
images=data["images"].cuda(),
intrinsics=data["intrinsics"].cuda()[None],
input_hw=[data["input_hw"]],
original_hw=[data["original_hw"]],
padding=[data["padding"]],
input_boxes=[[100, 200, 300, 400]], # pixel xyxy
prompt_text="geometric",
)
# Exemplar prompt: use a 2D box as visual exemplar, find all similar objects (one-to-many)
results = model(
images=data["images"].cuda(),
intrinsics=data["intrinsics"].cuda()[None],
input_hw=[data["input_hw"]],
original_hw=[data["original_hw"]],
padding=[data["padding"]],
input_boxes=[[100, 200, 300, 400]],
prompt_text="visual",
)
# Point prompt
results = model(
images=data["images"].cuda(),
intrinsics=data["intrinsics"].cuda()[None],
input_hw=[data["input_hw"]],
original_hw=[data["original_hw"]],
padding=[data["padding"]],
input_points=[[(150, 250, 1), (200, 300, 0)]], # (x, y, label): 1=positive, 0=negative
prompt_text="geometric",
)
# Visualize results
boxes, boxes3d, scores, scores_2d, scores_3d, class_ids, depth_maps = results
draw_3d_boxes(
image=image.astype(np.uint8),
boxes3d=boxes3d[0],
intrinsics=intrinsics,
scores_2d=scores_2d[0],
scores_3d=scores_3d[0],
class_ids=class_ids[0],
class_names=["car", "person", "bicycle"],
save_path="output.png",
)Notes:
- If
intrinsicsis not provided, a default intrinsic matrix is used (focal=max(H,W), principal point at image center). - Optional depth input: pass
depth_gt=depth_tensor(shape(B, 1, H, W), meters) for improved 3D localization with sparse/dense depth (e.g., LiDAR).
See docs/INFERENCE.md for the full API reference.
Download the evaluation data from allenai/WildDet3D-Data. Evaluate using the vis4d framework:
# Text prompt
vis4d test --config configs/eval/in_the_wild/text.py --gpus 1 --ckpt ckpt/wilddet3d_alldata_all_prompt_v1.0.pt
# Text prompt + GT depth
vis4d test --config configs/eval/in_the_wild/text_with_depth.py --gpus 1 --ckpt ckpt/wilddet3d_alldata_all_prompt_v1.0.pt
# Box prompt (oracle)
vis4d test --config configs/eval/in_the_wild/box_prompt.py --gpus 1 --ckpt ckpt/wilddet3d_alldata_all_prompt_v1.0.pt
# Box prompt + GT depth
vis4d test --config configs/eval/in_the_wild/box_prompt_with_depth.py --gpus 1 --ckpt ckpt/wilddet3d_alldata_all_prompt_v1.0.pt| Mode | Config |
|---|---|
| Text | configs/eval/in_the_wild/text.py |
| Text + Depth | configs/eval/in_the_wild/text_with_depth.py |
| Box Prompt | configs/eval/in_the_wild/box_prompt.py |
| Box Prompt + Depth | configs/eval/in_the_wild/box_prompt_with_depth.py |
Download the evaluation data from allenai/WildDet3D-Stereo4D-Bench-Images. Evaluate (383 images with real stereo depth):
# Text prompt
vis4d test --config configs/eval/stereo4d/text.py --gpus 1 --ckpt ckpt/wilddet3d_alldata_all_prompt_v1.0.pt
# Text prompt + GT depth
vis4d test --config configs/eval/stereo4d/text_with_depth.py --gpus 1 --ckpt ckpt/wilddet3d_alldata_all_prompt_v1.0.pt
# Box prompt (oracle)
vis4d test --config configs/eval/stereo4d/box_prompt.py --gpus 1 --ckpt ckpt/wilddet3d_alldata_all_prompt_v1.0.pt
# Box prompt + GT depth
vis4d test --config configs/eval/stereo4d/box_prompt_with_depth.py --gpus 1 --ckpt ckpt/wilddet3d_alldata_all_prompt_v1.0.pt| Mode | Config |
|---|---|
| Text | configs/eval/stereo4d/text.py |
| Text + Depth | configs/eval/stereo4d/text_with_depth.py |
| Box Prompt | configs/eval/stereo4d/box_prompt.py |
| Box Prompt + Depth | configs/eval/stereo4d/box_prompt_with_depth.py |
AP is computed using center-distance matching. AP_r, AP_c, AP_f denote rare (<5), common (5-20), and frequent (>20) category splits.
| Method | Data | AP_r | AP_c | AP_f | AP |
|---|---|---|---|---|---|
| Text Prompt | |||||
| 3D-MOOD | Omni3D | 2.4 | 2.1 | 2.6 | 2.3 |
| WildDet3D | Omni3D | 9.0 | 6.5 | 5.2 | 6.8 |
| WildDet3D w/ depth | Omni3D | 23.0 | 21.5 | 16.1 | 20.7 |
| WildDet3D | Omni3D, Others, WildDet3D-Data | 28.3 | 21.6 | 18.7 | 22.6 |
| WildDet3D w/ depth | Omni3D, Others, WildDet3D-Data | 47.4 | 40.7 | 37.2 | 41.6 |
| Box Prompt | |||||
| OVMono3D-LIFT | Omni3D | 7.4 | 8.8 | 5.1 | 7.7 |
| DetAny3D | Omni3D, Others | 9.9 | 7.4 | 6.3 | 7.8 |
| WildDet3D | Omni3D | 12.0 | 7.9 | 5.3 | 8.4 |
| WildDet3D w/ depth | Omni3D | 26.4 | 24.4 | 19.6 | 23.9 |
| WildDet3D | Omni3D, Others, WildDet3D-Data | 30.0 | 24.2 | 20.3 | 24.8 |
| WildDet3D w/ depth | Omni3D, Others, WildDet3D-Data | 53.7 | 46.1 | 42.5 | 47.2 |
AP is computed at 3D IoU [0.5:0.95].
| Method | KITTI | nuScenes | SUNRGBD | Hypersim | ARKitScenes | Objectron | AP |
|---|---|---|---|---|---|---|---|
| Text Prompt | |||||||
| Cube R-CNN | 32.6 | 30.1 | 15.3 | 7.5 | 41.7 | 50.8 | 23.3 |
| 3D-MOOD Swin-T | 32.8 | 31.5 | 21.9 | 10.5 | 51.0 | 64.3 | 28.4 |
| 3D-MOOD Swin-B | 31.4 | 35.8 | 23.8 | 9.1 | 53.9 | 67.9 | 30.0 |
| WildDet3D | 37.0 | 31.7 | 38.9 | 16.5 | 64.6 | 60.5 | 34.2 |
| WildDet3D w/ depth | 36.1 | 32.0 | 51.1 | 26.6 | 73.3 | 68.3 | 41.6 |
| Box Prompt | |||||||
| OVMono3D-LIFT | 31.4 | 32.5 | 23.2 | 11.9 | 54.2 | 63.5 | 29.6 |
| DetAny3D | 38.7 | 37.6 | 46.1 | 16.0 | 50.6 | 56.8 | 34.4 |
| WildDet3D | 44.3 | 35.3 | 43.1 | 17.3 | 66.6 | 60.8 | 36.4 |
| WildDet3D w/ depth | 42.8 | 35.9 | 58.7 | 30.4 | 76.6 | 68.5 | 45.8 |
Box prompt comparison (WildDet3D vs OVMono3D vs DetAny3D):
Text prompt comparison:
We introduce WildDet3D-Data, a large-scale in-the-wild dataset for monocular 3D detection with human-verified 3D bounding box annotations. The dataset covers images from COCO, LVIS, Objects365, and V3Det.
| Split | Description | Images | Annotations | Categories |
|---|---|---|---|---|
| Val | Validation set (human) | 2,470 | 9,256 | 785 |
| Test | Test set (human) | 2,433 | 5,596 | 633 |
| Train (Human) | Human-reviewed only | 102,979 | 229,934 | 11,879 |
| Train (Essential) | Human + VLM-qualified small objects | 102,979 | 412,711 | 12,064 |
| Train (Synthetic) | VLM auto-selected | 896,004 | 3,483,292 | 11,896 |
| Total | 1,003,886 | 3,910,855 | 13,499 |
The dataset is hosted on HuggingFace: allenai/WildDet3D-Data. See the dataset README for download instructions and data format.
If you find this work useful, please cite:
@misc{huang2026wilddet3dscalingpromptable3d,
title={WildDet3D: Scaling Promptable 3D Detection in the Wild},
author={Weikai Huang and Jieyu Zhang and Sijun Li and Taoyang Jia and Jiafei Duan and Yunqian Cheng and Jaemin Cho and Mattew Wallingford and Rustin Soraki and Chris Dongjoo Kim and Donovan Clay and Taira Anderson and Winson Han and Ali Farhadi and Bharath Hariharan and Zhongzheng Ren and Ranjay Krishna},
year={2026},
eprint={2604.08626},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.08626},
}- Omni3D -- 3D detection benchmarks and baselines
- vis4d -- Training and evaluation framework
- SAM 3 -- Segment Anything Model 3
- LingBot-Depth -- Monocular depth estimation
- 3D-MOOD -- Open-vocabulary monocular 3D detection
- DetAny3D -- Detect anything in 3D
- OVMono3D -- Open-vocabulary monocular 3D detection
- LabelAny3D -- 3D bounding box annotation tool
Codebase: This codebase incorporates code from SAM 3, and is licensed under the SAM License. It is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines.
Model: This model is based on SAM 3 and LingBot-Depth, and is licensed under the SAM License. This model is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines.






