Pali | Devpost

The training screen.
The Training Process
Machine Learning Data
Machine Learning Stats
Body Part Visualization: When gestures are recorded the parts are highlighted and tooltips explain each gesture.
The task screen. The bottom left says the indicated gesture and the confidence interval.

Inspiration

Over 17 million people worldwide live with cerebral palsy, with half facing some form of speech impairment. For many, that means struggling to communicate daily needs, connect socially, or be heard. Existing assistive tools often require users to adapt to a rigid system that doesn't account for how differently cerebral palsy can affect motor control from person to person. We built Pali because we believe communication should adapt to the person, and not the other way around.

What it does

Pali lets users communicate through custom gestures that they define themselves. A user assigns a gesture (facial, hand, or both) to a phrase or sentence, and Pali voices it out loud using ElevenLabs text-to-speech. Because every user's motor control is different, users choose gestures and their definitions, making Pali accessible to a much wider range of people. Users can view, edit, and delete their registered gestures, and an interactive illustration helps them visualize how each gesture maps to a command.

How we built it

We chose our tech stack with two priorities in mind: speed and privacy.

Google MediaPipe handles 3D landmark tracking across key points on the user's face and hands, and we stream those coordinates continuously to the backend via FastAPI websockets. This bypasses the limitations of standard browser requests for a much faster pipeline. The landmarks are smoothed with an Exponential Moving Average to reduce noise, then we compare against the user's recorded neutral expression.

On the ML side, we used a Support Vector Machine to identify the support vectors closest to our 34-dimensional vector plane, which let us clearly identify gestures from the neutral state. We then used a Radial Basis Function to calculates each gesture kernel's value based on its distance from the origin. Once confidence to cross 65%, Pali triggers the ElevenLabs Text-to-Speech API to voice out the associated phrase. Our training is intentionally simple: users hold a gesture for 4 rounds and confirm what the algorithm picked up.

Challenges we ran into

Early on, gestures weren't being recognized reliably, so we fixed this by adding more landmarks to better capture the differences between a neutral expression and an intentional gesture. We also hit a noise problem where the camera was picking up unintended movement and interfering with the algorithm's accuracy. Our solution was to let users select which body parts are relevant to their gestures, so Pali only focuses on those.

Accomplishments that we're proud of

Getting real-time gesture recognition working with lower lag and better accuracy was a huge milestone for us. We were also really happy with how the post-training editing feature turned out. With this, users can update what a gesture maps to and immediately test the change, which made the whole experience feel more user-friendly.

What we learned

We learned a lot about integrating ElevenLabs and got hands-on experience adding machine learning components. We learned to gauge how many training rounds are actually needed to get reliable gesture recognition. Most importantly, we learned how to iterate on our product so it's best aligned to serve our target audience, especially personalization that extends to gestures and commands.

What's next for Pali

In future implementation, we hope to bring in more robust motion-tracking, hardware like sensor gloves, and range-based motion tracking to support more complex gestures. All in all, with Pali we strive to make human-world interaction accessible and eliminate barriers to expression.