VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation
CVPR 2025
This is the official repository of VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation. For more details, please check our project website.
To install VidBot, follow these steps:
-
Clone the Repository:
git clone https://github.com/HanzhiC/vidbot.git cd vidbot -
Install Dependencies:
# Prepare the environment conda create -n vidbot python=3.10.9 conda activate vidbot # Ensure PyTorch 1.13.1 is installed, pytorch-lightning might change the PyTorch version pip install pytorch-lightning==1.8.6 pip install -r requirements.txt # Install PyTorch Scatter wget https://data.pyg.org/whl/torch-1.13.0%2Bcu117/torch_scatter-2.1.1%2Bpt113cu117-cp310-cp310-linux_x86_64.whl pip install torch_scatter-2.1.1+pt113cu117-cp310-cp310-linux_x86_64.whl rm -rf torch_scatter-2.1.1+pt113cu117-cp310-cp310-linux_x86_64.whl
-
Download Pretrained Weights and Demo Dataset:
sh scripts/download_ckpt_testdata.sh
You can now try out VidBot with the demo data we've provided!
-
(Optional) Install Third-Party Modules:
sh scripts/prepare_third_party_modules.sh
Follow the installation instructions from GroundingDINO, EfficientSAM, GraspNet, and GraspNetAPI to set up these third-party modules.
Note: The
transformerslibrary should be version 4.26.1. InstallingGroundingDINOmight change the version. InstallingMinkowskiEnginefor GraspNet can be painful. However, our framework can still function without GraspNet. In such cases, we will employ a simplified method to obtain the grasp poses.
-
Quick Start with VidBot: To quickly explore VidBot, you don't need to install any third-party modules. After downloading the weights and demo dataset, you can use our pre-saved bounding boxes to run the inference scripts with the following command:
bash scripts/test_demo.sh -
Testing VidBot with Your Own Data: To test VidBot using your own data, just put your collected dataset under the
./datasets/folder. Please ensure your data is organized to match the structure of our demo dataset:YOUR_DATASET_NAME/ ├── camera_intrinsic.json ├── color │ ├── 000000.png │ ├── 000001.png │ ├── 00000X.png ├── depth │ ├── 000000.png │ ├── 000001.png │ ├── 00000X.pngThe
camera_intrinsic.jsonfile should be structured as follows:{ "width": width, "height": height, "intrinsic_matrix": [ fx, 0, 0, 0, fy, 0, cx, cy, 1 ] }We recommend using an image resolution of 1280x720.
-
Run the Inference Script: To run tests with your own data, execute the following command, ensuring you understand the meaning of each input argument:
python demos/infer_affordance.py --config ./config/test_config.yaml --dataset YOUR_DATASET_NAME --frame FRAME_ID --instruction YOUR_INSTRUCTION --object OBJECT_CLASS --visualize
If you have installed GraspNet and wish to estimate the gripper pose, add the
--use_graspnetoption to the command.
If you find our work useful, please cite:
@article{chen2025vidbot,
author = {Chen, Hanzhi and Sun, Boyang and Zhang, Anran and Pollefeys, Marc and Leutenegger, Stefan},
title = {{VidBot}: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference},
year = {2025},
}Our codebase is built upon TRACE. Partial code is borrowed from ConvONet, afford-motion and rq-vae-transformer . Thanks for their great contribution!
This project is licensed under the MIT License. See LICENSE for more details.
