Inspiration

We kept running into the same annoying workflow — see something confusing on screen, take a screenshot, open a new tab, drag it into ChatGPT or Claude, type the question, wait for the answer, then switch back. Half the time we'd forget to delete the screenshots and they'd pile up on the desktop. The other half we'd lose context switching between tabs.

It felt broken. The information is right there on the screen — why can't we just ask about it directly?

That's what ScreenSense does. No screenshots cluttering your desktop. No tab switching. No copy-pasting. Just hold a key, ask your question, and get the answer right there on the page you're looking at.

## What it does

ScreenSense Voice is a Chrome extension that lets you hold a shortcut key, speak a question about what's on your screen, and get an instant AI-powered answer overlaid directly on the page. The AI automatically captures what you're seeing, transcribes your voice, and responds with streaming text and a short spoken summary — all without leaving the page.

Three display modes (text + audio, audio only, text only) and five explanation levels (Kid, Student, College, PhD, Executive) let you customize how you get answers. Conversation memory tracks up to 20 turns per tab so you can ask follow-ups naturally.

## How we built it

Chrome Extension with Manifest V3, fully in TypeScript. Three execution contexts work together: a content script renders the overlay UI inside Shadow DOM, a service worker handles orchestration and API calls, and an offscreen document captures microphone audio via MediaRecorder.

AI stack: Groq Whisper Large V3 Turbo for speech-to-text, Llama 4 Scout (vision) for screen-aware answers streamed via SSE, and ElevenLabs for natural text-to-speech with Web Speech API fallback.

## Challenges we ran into

  • Chrome MV3 service workers can't touch the microphone — had to use the offscreen document pattern to bridge that gap
  • Race conditions between TTS generation and quick follow-ups caused fetch failures — solved with fire-and-forget async
  • Overlay needed to work on every website without breaking — Shadow DOM isolation was the answer
  • Async settings loading caused blank overlays on fresh Chrome profiles — fixed by making DOM creation synchronous

## What we learned

  • Voice-first UX needs to feel instant — every 100ms of latency matters
  • The offscreen document API is powerful but coordination with service workers is tricky
  • Shadow DOM is essential for extension UIs that need to survive any host page's CSS
  • Simple UX wins — hold a key, talk, get an answer. No menus, no clicks, no friction

## What's next

  • On-device models for fully offline use
  • Multi-monitor and cross-tab awareness
  • Chrome Web Store publication
  • Accessibility features: screen reader support, high-contrast mode

Built With

  • chrome-extension-manifest-v3
  • elevenlabs-api
  • groq
  • llama-4-scout
  • mediarecorder-api
  • react
  • shadow-dom
  • typescript
  • webpack
  • whisper-large-v3-turbo
Share this project:

Updates