Inspiration

We've all been there:

  • Hands covered in flour mid-recipe, needing to scroll down
  • Standing at a whiteboard mid-presentation, fumbling for a mouse
  • Lying in bed with your laptop just out of reach

AFK started as a joke on the acronym — what if "Away From Keyboard" wasn't a problem to solve, but a mode to embrace? We are here to re-define hands-free browsing feel natural, not futuristic.


What it does

AFK is a Chrome extension that lets you control your browser entirely through hand gestures and voice commands — no keyboard, no mouse, no contact required.

Gesture controls:

  • Cursor tracking — index finger moves the cursor in real-time via webcam
  • Click — hold a pinch gesture for 700ms
  • Scroll — index + middle finger raised, move hand up/down
  • Tab switch — pinch index + middle, drag left/right
  • Back / Forward — wide left or right swipe

Voice commands (20+):

  • Navigation: "scroll down", "go back", "next tab", "new tab"
  • Playback: "play video", "pause", "mute", "next video"
  • Page control: "zoom in", "refresh", "enter fullscreen"
  • Input: "start writing" activates voice-to-text in any focused field
  • Fully customizable — remap any command to your own phrases in any language. Not a native English speaker? Set your commands in French, Tagalog, Bahasa, or whatever feels natural. AFK works the way you talk, not the other way around.

Eyes & attention detection:

  • Video pauses when you look away — using real-time eye and face landmark detection, AFK automatically pauses any playing video the moment you turn away from the screen, and resumes when you look back. No more missing content because you glanced away.

Text-to-speech (ElevenLabs):

  • AFK uses ElevenLabs to read back page content, confirmations, and command feedback in natural-sounding voice — closing the loop on a fully hands-free, eyes-free experience.

Privacy & control:

  • Live camera indicator always visible when active
  • Tracking pauses automatically when you look away from the screen
  • One-key kill switch disables everything instantly

How we built it

Layer Technology Role
Hand tracking MediaPipe Hands 21-landmark model, runs via WASM
Eye/face detection MediaPipe Face Mesh Gaze and attention tracking for look-away pause
Voice input Web Speech API Customizable phrase matching for 30+ commands
Voice output ElevenLabs TTS Natural-sounding spoken feedback and readback
Extension Chrome MV3 Content script + offscreen document
Gesture engine Custom JS Debounce, hold-timer, cooldown windows
HUD overlay Shadow DOM Isolated UI injected into every page

Key architectural decisions:

  • Offscreen document for camera access — MV3 disallows getUserMedia in content scripts
  • Web Worker for MediaPipe — keeps the main thread free, throttled to 24fps
  • Shadow DOM for the HUD — fully isolated from host page styles
  • Rolling-window phrase matcher on top of Web Speech API for stable, language-agnostic voice detection
  • ElevenLabs integration for TTS responses — so AFK can talk back, not just listen

Challenges we ran into

  • MV3 camera restrictions — routing the webcam through an offscreen document added latency we had to aggressively optimize away
  • Gesture false positives — raw landmarks are noisy; we built a multi-stage filter using velocity thresholding, hold-duration gating, and per-gesture cooldown windows
  • Voice stability — Web Speech API returns partial results in real-time; we had to wait for stable output while still feeling instant to the user
  • Multilingual voice commands — making the phrase matcher language-agnostic required decoupling the action logic entirely from hardcoded English strings, so any phrase in any language can map to any action
  • Gaze detection reliability — distinguishing "user looked away" from "user blinked" or "user shifted slightly" required tuning eye landmark thresholds carefully to avoid false pauses
  • Performance — running hand tracking, face mesh, and TTS simultaneously inside a live webpage is expensive; we stay under 8% CPU on a mid-range laptop by combining Web Worker offloading, 24fps throttling, and requestAnimationFrame gating
  • Shadow DOM isolation — injecting UI into arbitrary pages without breaking their layout required wrapping the entire HUD in a shadow root

Accomplishments that we're proud of

  • 30+ voice commands responding to natural phrasing, not rigid syntax — and fully remappable in any language
  • Look-away video pause — face and eye detection that automatically pauses video content the moment you stop watching, and resumes when you return
  • ElevenLabs TTS feedback — AFK doesn't just accept commands, it talks back in a natural voice, making the experience even more accessible
  • Privacy-first UX: live camera thumbnail + activity indicator + one-key kill switch
  • Everything runs client-side — zero data leaves the device for gesture and voice processing, no backend required

What we learned

  • Gesture UX is a feedback problem first. Adding a live camera thumbnail and gesture label to the HUD was the single biggest UX improvement — users stopped questioning whether it worked and just used it
  • Gestures and voice are complementary, not competing. Gestures handle spatial tasks naturally (cursor, scroll, drag); voice handles discrete commands cleanly. Together they cover nearly every browsing interaction
  • Language shouldn't be a barrier to accessibility. Building customizable voice commands taught us that defaulting to English is a design choice, not a technical constraint — and one worth challenging
  • MV3 is genuinely hard. A lot of MV2 patterns are gone and the documentation hasn't caught up — the offscreen document pattern for camera access took significant reverse-engineering

What's next for AFK

  • Two-hand gestures — pinch-to-zoom, two-hand swipe for fullscreen
  • Expanded multilingual support — pre-built command packs for major languages out of the box
  • Accessibility partnerships — optimize with occupational therapists for mobility impairment use cases
  • Firefox + Edge support — core architecture is already browser-agnostic
  • On-device wake-word detection for voice activation without always-on listening

Built With

  • accessibility
  • chrome
  • computervision
  • css
  • elevenlabs
  • face-mesh
  • gesture-control
  • gesturecontrol
  • hand-tracking
  • html
  • javascript
  • manifest-v3
  • mediapipe
  • shadow-dom
  • text-to-speech
  • voice-recognition
  • web-speech-api
  • web-workers
  • webassembly
Share this project:

Updates