Inspiration
As students navigating an increasingly digital world, we’ve noticed a frustrating gap in how we collaborate: we are often tethered to keyboards and mice that stifle natural expression. This problem is magnified when collaborators speak different languages. We thought back to how humans naturally communicate when words fail, they use their hands? We were inspired to build CVBoard to bring that primal, intuitive form of communication into the digital workspace. We wanted to create a platform where a designer in Tokyo and a developer in New York could sketch, speak, and understand one another instantly, using nothing but their natural movements. However we still added insurance in the form of transcription and translation.
What it does
CVBoard is a touchless, collaborative workspace that turns your camera into a controller. Intuitive Drawing: A "one-finger" gesture activates Drawing Mode, allowing users to sketch ideas on a collaborative whiteboard. Open palm to stop drawing. Voice Chat: A “two-finger” gesture opens up voice chat, allowing for voice translation and transcription in the ui. Additional Features: By holding up a “three-finger” gesture, you open a menu, allowing you to manually type in chat or share your screen. Open palm to close the menu.
How we built it
CVBoard is built on a React + Vite frontend with Tailwind CSS for styling, chosen for Vite's near-instant HMR and optimized bundle output, critical when you're running a heavy computer vision model client-side and can't afford any extra latency. We use both Babel and SWC as transpilation plugins depending on the build target. The heart of the system is MediaPipe Hands, running entirely in the browser via WebAssembly. The model tracks 21 hand landmarks per frame at real-time frame rates. Rather than relying on a pre-built gesture classifier, we engineered our own gesture recognition layer on top of the raw landmark data. We calculate Euclidean distances between specific fingertip landmarks (e.g., index tip at point 8, middle tip at point 12) and their corresponding knuckles to determine finger extension state. From there, we define gesture states: one extended finger triggers Draw Mode, two fingers triggers Voice Chat, three fingers opens the radial menu, and an open palm acts as a universal cancel/stop. This approach gave us precise control over sensitivity thresholds and let us tune each gesture independently. On the drawing side, we map the index fingertip's normalized canvas coordinates directly to SVG or canvas draw events, broadcasting each coordinate delta as a lightweight stroke event over WebSockets. The backend is a Node.js + Socket.io server (server.js) that acts as a relay, it receives stroke data, transcription strings, and presence events from one peer and fans them out to all others in the room. We also wrote a companion Python gesture bridge (gesture_toggle_bridge.py) to handle lower-level camera access and preprocessing where browser permissions were a constraint, communicating back to the frontend over a local WebSocket connection. For voice, we pipe the audio stream through the Web Speech API for real-time transcription, and layer translation on top so that spoken words appear in the collaborator's language in the UI.
Challenges we ran into
We had our early successes in getting individual computer’s UI to work, but the process of making it so that the program was able to have multiple people connect on it, to communicate, was the harder part of the project. The shared whiteboard had this problem. A late-joining peer would connect and see a blank canvas because we had no mechanism to send a snapshot of the current canvas state on join, only a live stream of new events going forward. Both bugs shared the same root cause: we had built real-time sync without accounting for the moment a new peer enters mid-session. Fixing it meant rethinking our connection flow to emit a full state payload the moment a second user joins, before any live events start flowing.
Accomplishments that we're proud
As this was our first time diving deep into using MediaPipe and video-gesture based computing, we are incredibly proud of successfully bridging the gap between raw Computer Vision and a functional web UI. It was the first hackathon for one of our members and the second for another, so being able to complete such an ambitious project was an achievement that we’re proud of. There were many bugs along the way, but being able to fix them all was something that we’re proud of.
What we learned
We learned how to use MediaPipe’s hand modeling system to predict gestures and use the confidence level to recognize them and translate to real-time actions on the screen. On the backend, we learned the intricacies of state management in a real-time environment. MediaPipe is a foundation, not a solution. The library hands you 21 landmarks per frame and steps back, everything above that is yours to build. We learned that gesture recognition is fundamentally a math problem to solve. The math of computing Euclidean distances between fingertip points (landmark 8 for index tip, landmark 4 for thumb tip, etc.) is straightforward. The hard part is deciding when a gesture is "real”, but with confidence windowing we can solve this problem and configure gestures for practical use.
What's next for CVBoard
If we had more time, we would add features such as being able to modify the tools that you are using to draw, such as different brush sizes, colors and other controls like pre-defined shapes. In addition, we would add two hand support and face recognition so we can more accurately define and interpret emotion. This would involve using more of mediapipe’s features and bridging it to our javascript front end. Another big weakness we would like to fix is taking our product from only LAN’s to being able to connect anywhere in the world through a WebRTC mesh network. Another weakness is only being able to host two people right now, but we hope to expand so that multiple people can use it at the same time. We also want to add more languages and incorporate more translation APIs in the future.
Built With
- babel
- javascript
- json
- mediapipe
- nextjs
- opencv
- python
- react
- swc
- tailwind
- vite
Log in or sign up for Devpost to join the conversation.