SilentLink: The Subvocal Operating System

💡 Inspiration

We live in a world dominated by voice assistants—Siri, Alexa, Google Assistant. But they all share a fatal flaw: they require sound.

This creates two massive barriers:

  1. Accessibility: For people with speech impairments, ALS, or paralysis, voice control is impossible, and traditional eye-trackers are prohibitively expensive.
  2. Privacy & Usability: You can’t shout at your phone in a library, a quiet office, or a crowded metro.

We asked ourselves: What if we could create a "Subvocal Operating System"? An interface that reads your intention through the movement of your lips and the gaze of your eyes, without a single sound ever leaving your mouth.

That is how SilentLink was born.

🤖 What it does

SilentLink is a browser-based accessibility tool that transforms a standard webcam into a high-precision input device.

  • Lip-to-Command Interface: Users can train specific mouth gestures (e.g., "Mouth Open," "Pucker," "Smile") to trigger digital actions like scrolling, clicking, or navigating via a "Silent Speech" protocol.
  • Few-Shot Learning: Unlike massive models that require terabytes of data, SilentLink uses Transfer Learning to learn your specific face in seconds directly in the browser.
  • Privacy-First: Everything runs client-side. No video feed is ever sent to a server.

⚙️ How we built it

We built SilentLink using a modern web stack designed for real-time computer vision.

The Stack

  • Frontend: React (Vite) for the UI, styled with Tailwind CSS for a "Cyberpunk HUD" aesthetic.
  • Computer Vision: Google's MediaPipe Face Mesh.
  • Machine Learning: TensorFlow.js with a K-Nearest Neighbors (KNN) classifier.

The Engineering (The "Secret Sauce")

The biggest challenge was making the detection robust against the user moving closer or further from the camera. We couldn't just use raw $x, y$ coordinates.

Instead, we engineered a Normalized Feature Vector. We extract 468 facial landmarks, but we focus on the lip geometry. We calculate the Euclidean Distance between key landmarks and normalize them against the face's vertical height.

For example, to detect the "Mouth Openness Ratio" ($R_{open}$), we use:

$$ R_{open} = \frac{|| \vec{L}{top} - \vec{L}{bottom} ||}{|| \vec{F}{top} - \vec{F}{bottom} ||} $$

Where:

  • $\vec{L}{top}$ and $\vec{L}{bottom}$ are the coordinates of the upper and lower lip lips.
  • $\vec{F}{top}$ and $\vec{F}{bottom}$ are the coordinates of the forehead and chin (to normalize for depth).

We feed these normalized vectors into a KNN Classifier running entirely in the browser. This allows the model to "learn" a new gesture with just 20-30 frames of training data.

🚧 Challenges we ran into

  • The "Depth" Problem: Initially, if a user leaned closer to the camera, the model thought their mouth was "bigger" (i.e., open). We solved this by implementing the normalization formula described above, making the system scale-invariant.
  • Asynchronous Hell: Managing the React state alongside the TensorFlow tensors and the MediaPipe animation loop was tricky. We had to carefully manage requestAnimationFrame to ensure we didn't cause memory leaks or freeze the browser UI.
  • Lighting Noise: Low light made the landmarks jittery. We implemented a smoothing function that averages the landmark positions over the last 5 frames to reduce noise.

🏆 Accomplishments that we're proud of

  • Zero Latency: We achieved near-instant inference (approx 15ms per frame) by keeping everything client-side with WebGL acceleration.
  • In-Browser Training: We didn't just deploy a model; we built a trainer. The fact that a user can define their own custom gestures in 10 seconds is a massive usability win.
  • **Accessibility

Built With

Share this project:

Updates