GEST: Gesture-Enabled System for Teleoperation

Table Number 271

Inspiration

When thinking about state-of-the-art VLA (vision-language-action models), they all rely on one thing, speech. They rely on humans giving them relatively detailed instructions on what to do in order for them to be of use. However, when humans interact with each other, they are general at times and often rely on gestures to convey messages. This especially applies for those who may not be able to communicate through speech, and as a result may not be able to communicate with robots for them to help them in a meaningful way. We set out to teach robots to understand intent from gestures and act, accordingly, pairing human hand signals with live teleoperation so the model can later learn the mapping from gesture to goal to action.

What it does

Our pipeline, GEST is a data-collection and teleoperation pipeline that:
Detects three gestures in real time, open palm(stop), fist(pick up items near you), and pointing(go there) by extracting hand landmarks from each joint and
Gets the area of intent when pointing at an area/table.
Let’s a teleoperator perform the appropriate actions(ex: drive to area of intent, pick and place, pour water) using two leader follower pairs.
Logs each gesture class/confidence scores, textual intent, video, depth,, area of intent coordinates, and joint states from both the leader and the follower.
Bluetooth teleoperation from anywhere in the world to use to operate in homes

How we built it

Gesture control: Implemented mediapipe to detect real time hand gesture through the robot’s realsense Wireless communication: Communicated with the robot via TCP to control the robotics arms Teleoperation/Data Collection Pipeline: Built a teleoperation pipeline that records both leader and follower arm joint data and top camera images

Challenges We Ran Into Power Issues There were lots of power issues due to the xLe being a version one prototype that required power-cycling every 15 minutes at times causing our time for work to decrease.

Hardware issues: Teleoperation initially failed because motor values were not updating. We resolved this by replacing the robot arm and motor driver, and experimenting with hardware fixes.

Limited documentation: The XLERobot SDK had minimal documentation, which slowed integration and debugging.

Connectivity issues: Frequent Wi-Fi drops affected real-time teleoperation and prevented consistent model training.

Accomplishments We’re Proud Of Gesture Recognition/Pipeline: Achieved consistent gesture recognition and mapped gestures to labeled episodes.

Functional teleoperation: We successfully teleoperated the arm to perform tasks like pouring water and cleaning a table, and bringing objects to its cart to bring them elsewhere.

Data collection progress: We recorded few episodes for gesture detection and teleoperated pick-and-place experiments.

What we learned

Compute is indispensable in order to deploy real-world useful models. Using a raspberry pi, we were extremely limited in our abilities, especially apparent when running the hand gesture recognition model on it. As a result, we learned to optimize our programs to be lightweight. Have a backup plan(ex: ethernet or tether) for limited wifi so that we could work with the robot. A consistent hardware platform is the key to scalable data collection. However, for this platform we were able to learn how to debug broken servos, camera errors, calibration errors, and power issues as a result and think like an engineer in solving issues.

What's next for GEST?

Next Steps: Train SmolVLA model: Run training on our collected dataset using GPU resources Collect more diverse data: Record additional episodes with object position variations to improve generalization Expand gesture vocabulary: Add more gestures (pointing for navigation, thumbs up for confirmation, custom task-specific gestures) Refine training pipeline: Experiment with training hyperparameters, data augmentation, and model architectures Language integration: Combine gestures with natural language for more complex instructions ("grab that cup" + pointing gesture) Assistive robotics deployment: Deploy GEST in real assistive contexts for people with speech impairments Complex task composition: Learn primitive actions that can be composed into new tasks without retraining