Skip to content

Karthikgaur8/IRIS

Repository files navigation

IRIS — Intelligent Reach & Interaction System

A spatial audio prosthetic that turns any smartphone into an object-finding guide for visually impaired users.

No special hardware. No wearables. No app install. Just open a URL on the phone you already own.


The Problem

2.2 billion people worldwide have some form of vision impairment. For many, the simple act of finding everyday objects — keys on a counter, a phone on a table, medication on a nightstand — requires asking someone for help or painstakingly sweeping their hands across a surface and hoping for contact.

Existing assistive apps like Be My Eyes and Seeing AI can describe what a camera sees: "Your keys are to the left." But description is where they stop. The user is left to translate a verbal hint into physical action, with no feedback on whether they're getting closer or drifting further away.

The gap: No existing solution provides continuous, real-time physical guidance from detection to touch.

What IRIS Does

IRIS bridges that gap with a three-phase closed-loop guidance system:

Phase 1 — Voice. The user taps Start and speaks naturally: "Find my keys." The app understands — no rigid commands, no button navigation.

Phase 2 — Sweep. The user holds their phone and slowly sweeps it across the table. IRIS sends camera frames to Gemini AI once per second, which judges how close the phone is to the target object. The phone vibrates faster as it gets closer — like a metal detector for everyday objects. Gemini also provides directional hints ("try moving left") spoken aloud.

Phase 3 — Touch. When the phone is directly over the target, IRIS says "Guiding now." The user props the phone and moves their hand into the camera view. Stereo audio beeps guide the hand: pitch and tempo increase with proximity, stereo panning indicates direction. When the hand reaches the object, IRIS confirms: "Found it!"

The entire interaction — from "find my keys" to fingers on the keys — takes about 30 seconds, eyes closed.

Why This Is Different

Be My Eyes Seeing AI IRIS
Identifies objects Yes (human + AI) Yes (AI) Yes (Gemini AI)
Describes location Yes ("to your left") Limited Yes
Guides you there No No Yes — continuous feedback until touch
Confirms you found it No No Yes — audio + Gemini visual confirmation
Works on any object Yes Fixed classes only Yes — describe anything in natural language
Requires install App download App download No — runs in browser
Requires hardware Phone Phone Phone (same one you have)
Feedback modality Voice description Voice + some haptic Haptic + spatial audio + voice

Be My Eyes is a pair of remote eyes. IRIS is a pair of remote hands.

The key technical differentiator: IRIS uses Gemini 2.5 Flash as a zero-shot semantic object detector. Unlike YOLO or MobileNet (trained on fixed object classes), Gemini can find anything you can describe in words — "the small white pill bottle behind the mug" — with no retraining. This makes IRIS infinitely flexible for real-world use.

How It Works (Technical)

Architecture

┌─────────────────────────────────────────────────────────┐
│                    Phone Browser                         │
│                                                          │
│  Camera (getUserMedia) ──→ Frame capture (JPEG)          │
│          │                        │                      │
│          ▼                        ▼ (1 req/sec)          │
│  MediaPipe WASM ◄──┐     Vercel API Routes               │
│  Hand tracking     │     ┌────────────────────┐          │
│  ~30fps client     │     │ /api/proximity      │──→ Gemini│
│          │         │     │ /api/detect         │   2.5    │
│          ▼         │     │ /api/confirm        │   Flash  │
│  Geometry Engine   │     └────────────────────┘          │
│  (distance + pan)  │              │                      │
│          │         │              ▼                      │
│          ▼         │     Bounding box / proximity        │
│  Sonifier ─────────┘                                     │
│          │                                               │
│          ▼                                               │
│  Web Audio API          navigator.vibrate()              │
│  (stereo beeps)         (haptic pulses)                  │
│          │                     │                         │
│          ▼                     ▼                         │
│      Earbuds              Phone motor                    │
└─────────────────────────────────────────────────────────┘

Everything except Gemini runs client-side. Hand tracking, geometry, audio synthesis, haptics, and speech all execute on the phone. The only server round-trip is one JPEG frame per second to the Gemini API through a Vercel edge function (which also keeps the API key server-side and secure).

Tech Stack

  • Framework: Next.js 14 (App Router), TypeScript, Tailwind CSS
  • Deployment: Vercel
  • Object Detection: Google Gemini 2.5 Flash (zero-shot, via REST API)
  • Hand Tracking: MediaPipe HandLandmarker (WASM, client-side, ~30fps)
  • Audio: Web Audio API (OscillatorNode + StereoPannerNode)
  • Haptics: Vibration API (navigator.vibrate(), Android Chrome)
  • Speech: Browser SpeechRecognition + SpeechSynthesis (free, no API)

Key Design Decisions

Bbox smoothing. Gemini returns slightly different bounding boxes each poll. Raw coordinates cause the target to "jump," confusing the audio guidance. IRIS maintains a rolling average of the last 3 bbox centers, giving stable guidance while still tracking movement.

Resolution-independent arrival detection. The "arrived" threshold is 10% of the frame diagonal, not a fixed pixel count. This ensures consistent behavior whether the camera provides 640×480 or 1920×1080.

Rolling window arrival. Instead of requiring N consecutive frames where the hand is "close enough" (which fails due to MediaPipe jitter), IRIS uses a rolling window: if 8 of the last 15 frames register arrival, it declares found. This tolerates natural hand tremor without false positives.

Phase-gated MediaPipe loading. The ~10MB MediaPipe WASM model is preloaded during Phase 2 (while the user is sweeping) so Phase 3 starts instantly with no loading delay.

Running Locally

git clone https://github.com/Karthikgaur8/IRIS.git
cd IRIS
npm install

Create .env.local:

GOOGLE_API_KEY=your-gemini-api-key
npm run dev

Open https://localhost:3000 on your phone (same WiFi network) or laptop.

Deploying to Vercel

vercel
vercel env add GOOGLE_API_KEY    # paste your key
vercel --prod

Open the Vercel URL on any Android phone with Chrome. Grant camera + microphone permissions. Plug in wired earbuds. Tap Start.

Usage

  1. Tap START → grant camera + mic permissions
  2. Say what you're looking for: "red earbuds case"
  3. Sweep your phone slowly over the table — feel vibrations intensify as you get closer
  4. When IRIS says "Guiding now" — prop the phone, move your hand into frame
  5. Follow the stereo beeps to the object
  6. "Found it!"

Accessibility

  • All interactive elements have aria-label attributes
  • Status updates use aria-live="polite" regions for screen reader compatibility
  • Works with VoiceOver (iOS) and TalkBack (Android) for initial navigation
  • Zero visual dependency during use — the entire UX is audio + haptic
  • Voice input with text fallback if speech recognition is unavailable

Limitations

  • Haptics: navigator.vibrate() works on Android Chrome only. iOS Safari does not support it — haptic feedback is gracefully disabled.
  • 2D camera: A single phone camera cannot perceive true depth. IRIS compensates by using Gemini's understanding of apparent object size as a proxy for distance during Phase 2.
  • Latency: Each Gemini API call takes 0.5–1.5 seconds. Phase 2 (proximity) and Phase 3 (bbox) poll once per second — fast enough for a tabletop scenario, not for navigation.
  • Bluetooth earbuds: Add 100–300ms audio latency, which desynchronizes the beeps from hand movement. Wired earbuds are recommended.

Project Origin

Built for a hackathon. Started as a Python prototype with OpenCV + MediaPipe + sounddevice, then ported to the web for universal phone access. The original spec called for the Gemini Live API, but standard generateContent proved more reliable for structured JSON responses (bounding boxes, proximity scores).

The name "Ariadne" references the Greek myth — the thread that guided Theseus through the labyrinth. IRIS (Intelligent Reach & Interaction System) is the deployment name.

License

MIT

About

Hackathon project: A spatial audio prosthetic that turns any smartphone into an object-finding guide for visually impaired users.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors