Inspiration

2.2 billion people worldwide have some form of vision impairment. For many of them, finding a bottle of medication on a nightstand or keys on a kitchen counter is a daily friction point — not dangerous, not dramatic, just quietly exhausting. The tools that exist give them eyes. Nobody gives them hands.

I wanted to close that loop: from "I see it" all the way to "I'm touching it."

My starting question was simple: what happens after "your keys are to the left"?

Apps like Be My Eyes and Seeing AI can tell a visually impaired person what's in front of them. But they stop at description. The user hears "your keys are on the counter, to the left" — and then they're on their own. They sweep their hand across the surface, hoping to make contact, with zero feedback on whether they're getting warmer or colder. This is why I built IRIS.

What it does

IRIS turns any smartphone into a spatial guidance system with three phases:

Phase 1 — Voice. You tap Start and say "find my keys." Natural language — no menus, no buttons to navigate.

Phase 2 — Metal detector. You sweep your phone over the table. It vibrates faster as you get closer to the object (like a metal detector), and speaks directional hints: "try moving left." Gemini AI judges proximity by how large the object appears in the camera frame — no depth sensor needed.

Phase 3 — Hand guidance. When the phone is directly above the target, you prop it down. Stereo audio beeps guide your hand: pitch rises and tempo increases as you close in, stereo panning tells you left vs right. When your fingers reach the object: "Found it!"

The entire flow — from speaking the target name to touching the object — takes about 30 seconds with your eyes closed.

How I built it

The Python prototype came first. I started with OpenCV + MediaPipe + sounddevice on my laptop, building each module in isolation with its own test suite: geometry engine (pure math), sonifier (maps distance to audio params), audio engine (stereo beeps via callback streaming), hand tracker (MediaPipe landmark 8). 39 tests, all passing, before I ever connected them.

Then I hit reality. The original spec called for OpenCV's CSRT tracker with a one-shot Gemini bounding box. In practice, OpenCV's legacy trackers are deprecated, and a single bbox drifts within seconds. I scrapped the tracker entirely and switched to polling Gemini 2.5 Flash once per second — the object is sitting on a table, it doesn't need 30fps tracking.

The web port changed the product. Moving from Python to Next.js + Vercel meant the app runs in any phone browser — no install, no app store, no special hardware. MediaPipe has an official WASM build that runs client-side at 30fps. Web Audio API gives sub-10ms stereo synthesis. The only server round-trip is one JPEG per second to a Vercel API route that proxies to Gemini (keeping the API key secure).

The three-phase UX emerged from testing. The original design was just hand-tracking + beeps (Phase 3). But testing revealed a fundamental problem: you have to prop the phone to see both the object and your hand, but a blind user can't verify the camera angle. The "metal detector" sweep (Phase 2) solves this — the phone IS the sensor while you hold it, and haptic vibration guides you close enough that propping it becomes trivial.

Challenges I faced

The init race condition. MediaPipe's WASM model is ~10MB. I preloaded it during Phase 2, but initHandTracker() returned immediately when initialization was already in progress — leaving the hand tracker null when Phase 3 started. Beeps wouldn't fire for 3-6 seconds. Fix: store the init Promise so concurrent callers await the same completion.

Bbox jitter. Gemini returns slightly different bounding box coordinates every poll. The target would "jump" 100+ pixels between frames, making the audio guidance schizophrenic. Fix: smooth the bbox center with a 3-sample rolling average. Fresh enough to track real movement, stable enough to guide a hand.

The 30px arrival problem. My original "you've arrived" threshold was 30 pixels — precise on paper, impossible in practice. MediaPipe's fingertip detection jitters by 10-20px per frame, and the bbox center is only approximate. The hand would hover right on the object but never trigger "found." Fix: resolution-independent threshold (10% of frame diagonal) with a rolling window — 8 of the last 15 frames within threshold = found. Tolerates jitter, no false positives.

Speech blocking audio. Browser speechSynthesis and Web Audio compete for the audio output on Android Chrome. When the app said "Guiding now," it would suppress the beeps for 2-3 seconds. Fix: shortened all transition speech to under 1 second, and added speechSynthesis.cancel() before every new utterance so nothing queues up.

The 2D depth problem. A phone camera is 2D — it can't tell if the object is 6 inches or 6 feet away. I solved this without a depth sensor by asking Gemini to rate proximity 0-10 based on apparent object size in the frame. A coffee mug filling 40% of the frame means you're right on top of it. One at 5% means you're far. Gemini understands perspective intuitively — no custom training required.

What I learned

Building for accessibility forces you to think about assumptions you never question. "Show a green box on the target" is useless if the user can't see the screen. "Press F to find" doesn't work on a phone. "Camera at this angle" can't be verified without sight.

Every design decision had to pass one test: can a person who cannot see the screen use this with zero sighted assistance? That constraint killed half my features and made the surviving half dramatically better.

I also learned that the gap between "AI can see" and "AI can guide" is enormous. Computer vision research is obsessed with detection accuracy. But for assistive tech, detection is table stakes — the real product is what happens in the seconds between "I found your keys" and "your hand is on them." That's the space IRIS lives in.

What's next

  • Gemini Live API integration for natural voice narration ("you're getting warmer... a little to the left... almost there") instead of robotic TTS
  • Bone conduction audio support — doesn't block ambient sound, critical for safety
  • Multi-object scene memory — "what's on this table?" followed by "guide me to the phone"
  • Indoor navigation — extend Phase 2 beyond tabletop to room-scale with phone IMU sensors

Built With

Share this project:

Updates