GitHub - ethz-mrl/VidBot: [CVPR 2025] VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation

VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation
CVPR 2025

This is the official repository of VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation. For more details, please check our project website.

Installation

To install VidBot, follow these steps:

Clone the Repository:

git clone https://github.com/HanzhiC/vidbot.git
cd vidbot

Install Dependencies:

# Prepare the environment
conda create -n vidbot python=3.10.9
conda activate vidbot

# Ensure PyTorch 1.13.1 is installed, pytorch-lightning might change the PyTorch version
pip install pytorch-lightning==1.8.6
pip install -r requirements.txt  

# Install PyTorch Scatter
wget https://data.pyg.org/whl/torch-1.13.0%2Bcu117/torch_scatter-2.1.1%2Bpt113cu117-cp310-cp310-linux_x86_64.whl 
pip install torch_scatter-2.1.1+pt113cu117-cp310-cp310-linux_x86_64.whl
rm -rf torch_scatter-2.1.1+pt113cu117-cp310-cp310-linux_x86_64.whl

Download Pretrained Weights and Demo Dataset:
```
sh scripts/download_ckpt_testdata.sh 
```
You can now try out VidBot with the demo data we've provided!
(Optional) Install Third-Party Modules:
```
sh scripts/prepare_third_party_modules.sh
```
Follow the installation instructions from GroundingDINO, EfficientSAM, GraspNet, and GraspNetAPI to set up these third-party modules.

Note: The transformers library should be version 4.26.1. Installing GroundingDINO might change the version. Installing MinkowskiEngine for GraspNet can be painful. However, our framework can still function without GraspNet. In such cases, we will employ a simplified method to obtain the grasp poses.

Affordance Inference

Quick Start with VidBot: To quickly explore VidBot, you don't need to install any third-party modules. After downloading the weights and demo dataset, you can use our pre-saved bounding boxes to run the inference scripts with the following command:
```
bash scripts/test_demo.sh 
```

Testing VidBot with Your Own Data: To test VidBot using your own data, just put your collected dataset under the ./datasets/ folder. Please ensure your data is organized to match the structure of our demo dataset:

YOUR_DATASET_NAME/
├── camera_intrinsic.json
├── color
│   ├── 000000.png
│   ├── 000001.png
│   ├── 00000X.png
├── depth
│   ├── 000000.png
│   ├── 000001.png
│   ├── 00000X.png

The camera_intrinsic.json file should be structured as follows:

{
    "width": width,
    "height": height,
    "intrinsic_matrix": [
        fx,
        0,
        0,
        0,
        fy,
        0,
        cx,
        cy,
        1
    ]
}

We recommend using an image resolution of 1280x720.

Run the Inference Script: To run tests with your own data, execute the following command, ensuring you understand the meaning of each input argument:
```
 python demos/infer_affordance.py 
 --config ./config/test_config.yaml  
 --dataset YOUR_DATASET_NAME  
 --frame FRAME_ID  
 --instruction YOUR_INSTRUCTION  
 --object OBJECT_CLASS  
 --visualize  
```
If you have installed GraspNet and wish to estimate the gripper pose, add the --use_graspnet option to the command.

Citation

If you find our work useful, please cite:

@article{chen2025vidbot,
    author    = {Chen, Hanzhi and Sun, Boyang and Zhang, Anran and Pollefeys, Marc and Leutenegger, Stefan},
    title     = {{VidBot}: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference},
    year      = {2025},
}

Acknowledgement

Our codebase is built upon TRACE. Partial code is borrowed from ConvONet, afford-motion and rq-vae-transformer . Thanks for their great contribution!

License

This project is licensed under the MIT License. See LICENSE for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
algos		algos
assets		assets
config		config
demos		demos
diffuser_utils		diffuser_utils
models		models
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation
CVPR 2025

Installation

Affordance Inference

Citation

Acknowledgement

License

About

Uh oh!

Releases

Packages

Languages

License

ethz-mrl/VidBot

Folders and files

Latest commit

History

Repository files navigation

VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation CVPR 2025

Installation

Affordance Inference

Citation

Acknowledgement

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation
CVPR 2025

Packages