This is the offical implementation of our paper: V-HOP: Visuo-Haptic 6D Object Pose Tracking, accepted by Robotics: Science and Systems (RSS) 2025.
Hongyu Li, Mingxi Jia, Tuluhan Akbulut, Yu Xiang, George Konidaris, and Srinath Sridhar.
Humans naturally integrate vision and haptics for robust object perception during manipulation. The loss of either modality significantly degrades performance. Inspired by this multisensory integration, prior object pose estimation research has attempted to combine visual and haptic/tactile feedback. Although these works demonstrate improvements in controlled environments or synthetic datasets, they often underperform vision-only approaches in real-world settings due to poor generalization across diverse grippers, sensor layouts, or sim-to-real environments. Furthermore, they typically estimate the object pose for each frame independently, resulting in less coherent tracking over sequences in real-world deployments. To address these limitations, we introduce a novel unified haptic representation that effectively handles multiple gripper embodiments. Building on this representation, we introduce a new visuo-haptic transformer-based object pose tracker that seamlessly integrates visual and haptic input. We validate our framework in our dataset and the Feelsight dataset, demonstrating significant performance improvement on challenging sequences. Notably, our method achieves superior generalization and robustness across novel embodiments, objects, and sensor types (both taxel-based and vision-based tactile sensors). In real-world experiments, we demonstrate that our approach outperforms state-of-the-art visual trackers by a large margin. We further show that we can achieve precise manipulation tasks by incorporating our real-time object tracking result into motion plans, underscoring the advantages of visuo-haptic perception.
Our code is wrapped in a docker container, therefore you don't need to install any dependencies manually.
Clone the repository:
# Clone the repository
git clone https://github.com/brown-ivl/v-hop.git
cd v-hopYou can pull the docker image:
docker pull lhy0807/v-hop:latestOr build the docker image:
sh docker/build.shSet the dataset directory path DATA_DIR in docker/run_container.sh to the path of your dataset.
Run the docker container:
sh docker/run.shWe store our dataset in SquashFS format. Therefore, you need to mount the dataset to your local machine. We mount it using Singularity. An alternative way is to use library like PySquashfsImage to read the dataset. However, it is not officially supported by the authors.
We are still working on the dataset preparation. For the full dataset, ID 0-41 are for training, and ID 42-48 are for validation. We have prepared a small subset of the dataset for you to test the code. You can download the dataset from here.
To preprocess the dataset, run:
sh preprocess.shpython train.pyDownload the checkpoint from here.
python test_subset.pyv-hop/
├── docker/ # Docker configuration
├── config/ # Configuration files
├── dataset/ # Dataset related files
├── networks/ # Model files
├── FoundationPose/ # FoundationPose integration
This project is licensed under the Attribution-NonCommercial 4.0 International license.
If you find this work useful, please consider citing:
@inproceedings{li2025vhop,
title={V-HOP: Visuo-Haptic 6D Object Pose Tracking},
author={Li, Hongyu and Jia, Mingxi and Akbulut, Tuluhan and Xiang, Yu and Konidaris, George and Sridhar, Srinath},
booktitle={Proceedings of Robotics: Science and Systems},
year={2025}
}Please contact Hongyu Li (hongyu@brown.edu) for any questions.
This work is supported by the National Science Foundation (NSF) under CAREER grant #2143576, grant #2346528, and the Office of Naval Research (ONR) grant #N00014-22-1-259. We thank Ying Wang, Tao Lu, Zekun Li, and Xiaoyan Cong for their valuable discussions. We thank the area chair and the reviewers for providing constructive feedback on improving the quality and clarity of our paper. This research was conducted using computational resources and services at the Center for Computation and Visualization, Brown University.
Our codebase is built on top of the following projects:
- FoundationPose: We adopt their network and pretrained model.
- dex-urdf: We adopt their collection of URDF models for generating synthetic data.
We thank the authors for their great work.
