Inspiration

2.2 billion people worldwide live with vision impairment, yet 96% of websites fail basic accessibility standards. Screen readers haven't fundamentally changed in decades — they force users to tab through elements one by one, reading out raw HTML structure. We asked: what if blind and low-vision users could just talk to their computer and have it understand what's on screen, take action, and speak back?

Claude AI can now understand web pages, reason about content, and act on it. We built Thea to put that power directly in the hands of the people who need it most.

What It Does

Thea is a voice-first desktop assistant that lets blind and low-vision users converse with their computer to navigate any website — no visual layout knowledge required.

Instead of commands, users speak intent:

  • "What's on this page?"
  • "Read my latest email"
  • "Find me a train to London tomorrow morning"
  • "Fill in this form for me"

Thea understands the page, executes actions, and speaks the results back. It works on any website, even poorly built ones.

Voice pipeline:

$$\text{User speech} \xrightarrow{\text{Whisper}} \text{Text} \xrightarrow{\text{Claude}} \text{Actions} \xrightarrow{\text{Playwright}} \text{Result} \xrightarrow{\text{ElevenLabs}} \text{Spoken response}$$

The OpenClaw sidecar process manages browser automation sessions, maintaining state between commands so conversations feel natural and continuous.

Challenges

  • Safety guardrails — Thea must never act alone on irreversible actions (payments, form submissions, account changes). We built a confirmation layer so the user always stays in control.
  • Unreliable web pages — Most sites have broken or missing accessibility markup. We had to make Claude robust enough to understand pages even without proper labels or ARIA attributes.
  • Audio pipeline latency — Chaining STT → AI reasoning → browser automation → TTS introduced noticeable delays. We optimized each hop and added real-time UI state feedback (listening, transcribing, working, speaking) so users always know what's happening.
  • Cross-platform hotkey detection — Capturing a system-wide push-to-talk key reliably across macOS, Windows, and Linux required low-level native hooks.

What We Learned

The biggest insight: accessibility isn't a feature — it's a fundamentally different interaction model. Sighted users navigate spatially; blind users navigate conversationally. Building for conversation-first, not screen-first, changed every design decision we made.

Share this project:

Updates