A spatial audio prosthetic that turns any smartphone into an object-finding guide for visually impaired users.
No special hardware. No wearables. No app install. Just open a URL on the phone you already own.
2.2 billion people worldwide have some form of vision impairment. For many, the simple act of finding everyday objects — keys on a counter, a phone on a table, medication on a nightstand — requires asking someone for help or painstakingly sweeping their hands across a surface and hoping for contact.
Existing assistive apps like Be My Eyes and Seeing AI can describe what a camera sees: "Your keys are to the left." But description is where they stop. The user is left to translate a verbal hint into physical action, with no feedback on whether they're getting closer or drifting further away.
The gap: No existing solution provides continuous, real-time physical guidance from detection to touch.
IRIS bridges that gap with a three-phase closed-loop guidance system:
Phase 1 — Voice. The user taps Start and speaks naturally: "Find my keys." The app understands — no rigid commands, no button navigation.
Phase 2 — Sweep. The user holds their phone and slowly sweeps it across the table. IRIS sends camera frames to Gemini AI once per second, which judges how close the phone is to the target object. The phone vibrates faster as it gets closer — like a metal detector for everyday objects. Gemini also provides directional hints ("try moving left") spoken aloud.
Phase 3 — Touch. When the phone is directly over the target, IRIS says "Guiding now." The user props the phone and moves their hand into the camera view. Stereo audio beeps guide the hand: pitch and tempo increase with proximity, stereo panning indicates direction. When the hand reaches the object, IRIS confirms: "Found it!"
The entire interaction — from "find my keys" to fingers on the keys — takes about 30 seconds, eyes closed.
| Be My Eyes | Seeing AI | IRIS | |
|---|---|---|---|
| Identifies objects | Yes (human + AI) | Yes (AI) | Yes (Gemini AI) |
| Describes location | Yes ("to your left") | Limited | Yes |
| Guides you there | No | No | Yes — continuous feedback until touch |
| Confirms you found it | No | No | Yes — audio + Gemini visual confirmation |
| Works on any object | Yes | Fixed classes only | Yes — describe anything in natural language |
| Requires install | App download | App download | No — runs in browser |
| Requires hardware | Phone | Phone | Phone (same one you have) |
| Feedback modality | Voice description | Voice + some haptic | Haptic + spatial audio + voice |
Be My Eyes is a pair of remote eyes. IRIS is a pair of remote hands.
The key technical differentiator: IRIS uses Gemini 2.5 Flash as a zero-shot semantic object detector. Unlike YOLO or MobileNet (trained on fixed object classes), Gemini can find anything you can describe in words — "the small white pill bottle behind the mug" — with no retraining. This makes IRIS infinitely flexible for real-world use.
┌─────────────────────────────────────────────────────────┐
│ Phone Browser │
│ │
│ Camera (getUserMedia) ──→ Frame capture (JPEG) │
│ │ │ │
│ ▼ ▼ (1 req/sec) │
│ MediaPipe WASM ◄──┐ Vercel API Routes │
│ Hand tracking │ ┌────────────────────┐ │
│ ~30fps client │ │ /api/proximity │──→ Gemini│
│ │ │ │ /api/detect │ 2.5 │
│ ▼ │ │ /api/confirm │ Flash │
│ Geometry Engine │ └────────────────────┘ │
│ (distance + pan) │ │ │
│ │ │ ▼ │
│ ▼ │ Bounding box / proximity │
│ Sonifier ─────────┘ │
│ │ │
│ ▼ │
│ Web Audio API navigator.vibrate() │
│ (stereo beeps) (haptic pulses) │
│ │ │ │
│ ▼ ▼ │
│ Earbuds Phone motor │
└─────────────────────────────────────────────────────────┘
Everything except Gemini runs client-side. Hand tracking, geometry, audio synthesis, haptics, and speech all execute on the phone. The only server round-trip is one JPEG frame per second to the Gemini API through a Vercel edge function (which also keeps the API key server-side and secure).
- Framework: Next.js 14 (App Router), TypeScript, Tailwind CSS
- Deployment: Vercel
- Object Detection: Google Gemini 2.5 Flash (zero-shot, via REST API)
- Hand Tracking: MediaPipe HandLandmarker (WASM, client-side, ~30fps)
- Audio: Web Audio API (OscillatorNode + StereoPannerNode)
- Haptics: Vibration API (
navigator.vibrate(), Android Chrome) - Speech: Browser SpeechRecognition + SpeechSynthesis (free, no API)
Bbox smoothing. Gemini returns slightly different bounding boxes each poll. Raw coordinates cause the target to "jump," confusing the audio guidance. IRIS maintains a rolling average of the last 3 bbox centers, giving stable guidance while still tracking movement.
Resolution-independent arrival detection. The "arrived" threshold is 10% of the frame diagonal, not a fixed pixel count. This ensures consistent behavior whether the camera provides 640×480 or 1920×1080.
Rolling window arrival. Instead of requiring N consecutive frames where the hand is "close enough" (which fails due to MediaPipe jitter), IRIS uses a rolling window: if 8 of the last 15 frames register arrival, it declares found. This tolerates natural hand tremor without false positives.
Phase-gated MediaPipe loading. The ~10MB MediaPipe WASM model is preloaded during Phase 2 (while the user is sweeping) so Phase 3 starts instantly with no loading delay.
git clone https://github.com/Karthikgaur8/IRIS.git
cd IRIS
npm installCreate .env.local:
GOOGLE_API_KEY=your-gemini-api-key
npm run devOpen https://localhost:3000 on your phone (same WiFi network) or laptop.
vercel
vercel env add GOOGLE_API_KEY # paste your key
vercel --prodOpen the Vercel URL on any Android phone with Chrome. Grant camera + microphone permissions. Plug in wired earbuds. Tap Start.
- Tap START → grant camera + mic permissions
- Say what you're looking for: "red earbuds case"
- Sweep your phone slowly over the table — feel vibrations intensify as you get closer
- When IRIS says "Guiding now" — prop the phone, move your hand into frame
- Follow the stereo beeps to the object
- "Found it!"
- All interactive elements have
aria-labelattributes - Status updates use
aria-live="polite"regions for screen reader compatibility - Works with VoiceOver (iOS) and TalkBack (Android) for initial navigation
- Zero visual dependency during use — the entire UX is audio + haptic
- Voice input with text fallback if speech recognition is unavailable
- Haptics:
navigator.vibrate()works on Android Chrome only. iOS Safari does not support it — haptic feedback is gracefully disabled. - 2D camera: A single phone camera cannot perceive true depth. IRIS compensates by using Gemini's understanding of apparent object size as a proxy for distance during Phase 2.
- Latency: Each Gemini API call takes 0.5–1.5 seconds. Phase 2 (proximity) and Phase 3 (bbox) poll once per second — fast enough for a tabletop scenario, not for navigation.
- Bluetooth earbuds: Add 100–300ms audio latency, which desynchronizes the beeps from hand movement. Wired earbuds are recommended.
Built for a hackathon. Started as a Python prototype with OpenCV + MediaPipe + sounddevice, then ported to the web for universal phone access. The original spec called for the Gemini Live API, but standard generateContent proved more reliable for structured JSON responses (bounding boxes, proximity scores).
The name "Ariadne" references the Greek myth — the thread that guided Theseus through the labyrinth. IRIS (Intelligent Reach & Interaction System) is the deployment name.
MIT