Eyeris — AI Visual Assistant

Real-time AI-powered visual assistance for blind and low-vision users. Runs entirely in the browser — no install, no backend, just open a link and go.

Built with Railtracks | For GenAI Genesis 2026 | By Jacob Mobin

What is Eyeris?

Eyeris is a browser-based AI companion that helps blind and low-vision users understand their surroundings in real time. Point your phone's camera at the world, talk naturally, and get instant spoken descriptions of what's around you.

It combines Gemini 2.5 Flash for scene understanding, Depth Anything V2 for on-device spatial awareness, OpenAI Whisper for speech recognition, and ElevenLabs Flash v2.5 for natural text-to-speech — all running client-side with zero server infrastructure.

Key Features

Real-time object detection — Bounding boxes with labels track objects in the camera feed at ~2.5s intervals
On-device depth estimation — Depth Anything V2 Small runs via WebGPU/WASM to sense how close objects are, with depth-colored overlays on detected objects
Natural voice conversation — Speak naturally and get instant spoken responses; supports multi-turn dialogue with conversation history
Three integrated modes — SCAN (continuous scene analysis), READ (full scene + text description), FIND (locate specific objects)
Voice Activity Detection — RMS-based VAD with Whisper transcription means hands-free, always-listening interaction
Barge-in support — Interrupt Eyeris mid-sentence and it stops to listen
Obstacle alerts — Automatic haptic + audio warnings when objects are dangerously close
Accessibility-first design — ARIA labels, screen reader support, high-contrast Bauhaus design system

Demo

Live Demo: Open localhost:5173 after running the dev server (see Setup below)

How It Works

Launch — Open the app, grant camera + mic permissions
Eyeris greets you — "Hey, how can I help?"
Talk naturally — Ask anything: "What's in front of me?", "Read that sign", "Where's the door?"
Get instant answers — Eyeris responds with natural speech while updating visual overlays in real time

Architecture

┌─────────────────────────────────────────────────────┐
│                    BROWSER (Client-Side)             │
│                                                      │
│  ┌──────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │ Camera   │  │ Gemini 2.5   │  │ Depth Anything│  │
│  │ Feed     │──│ Flash (REST) │  │ V2 (WebGPU)   │  │
│  └──────────┘  └──────┬───────┘  └───────┬───────┘  │
│       │               │                  │           │
│  ┌────▼────┐   ┌──────▼───────┐  ┌──────▼───────┐   │
│  │ Whisper │   │ Bounding Box │  │ Depth Overlay │   │
│  │ (STT)   │   │ + Captions   │  │ + Mini-Map    │   │
│  └────┬────┘   └──────────────┘  └──────────────┘   │
│       │                                              │
│  ┌────▼──────────────────────────────────────────┐   │
│  │         ElevenLabs Flash v2.5 (TTS)           │   │
│  │    Sentence-chunked streaming for low TTFB    │   │
│  └───────────────────────────────────────────────┘   │
│                                                      │
│  ┌───────────────────────────────────────────────┐   │
│  │  React 19 + Zustand + Framer Motion + Tailwind│   │
│  │  Bauhaus Design System · Vite 5 Dev Server    │   │
│  └───────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────┘

Tech Stack

Layer	Technology	Purpose
Vision AI	Gemini 2.5 Flash (REST API)	Scene analysis, object detection, voice responses
Depth Sensing	Depth Anything V2 Small (`@huggingface/transformers`)	On-device depth estimation via WebGPU with WASM fallback
Speech-to-Text	OpenAI Whisper API (`whisper-1`)	Fast, accurate voice transcription
Text-to-Speech	ElevenLabs Flash v2.5	Low-latency sentence-chunked streaming TTS
Frontend	React 19, Zustand, Framer Motion	UI state management and animations
Styling	Tailwind CSS 3	Bauhaus-inspired high-contrast design system
Icons	Lucide React	Consistent iconography
Build	Vite 5	Fast HMR dev server with ES module workers

Project Structure

vision-companion/
├── public/
│   └── demo/                  # Landing page demo assets
├── src/
│   ├── components/
│   │   ├── LandingPage.jsx    # Hero + animated phone mockup
│   │   ├── MainView.jsx       # Camera view orchestrator
│   │   ├── CameraFeed.jsx     # getUserMedia + depth init
│   │   ├── OverlayCanvas.jsx  # Bounding boxes + depth masks
│   │   ├── DepthMiniMap.jsx   # Real-time depth heatmap
│   │   ├── CaptionBar.jsx     # Scene captions + thinking state
│   │   ├── ControlBar.jsx     # Mode switching + mic/speaker
│   │   ├── AvatarView.jsx     # Speaking/thinking avatar
│   │   ├── StatusIndicator.jsx# Connection + FPS badges
│   │   ├── SafetyBanner.jsx   # Obstacle warnings
│   │   └── OnboardingModal.jsx# First-run tutorial
│   ├── services/
│   │   ├── geminiService.js   # Gemini API (scan + voice streaming)
│   │   ├── agentLoop.js       # Always-on scan loop + voice handler
│   │   ├── continuousListener.js # Whisper VAD + transcription
│   │   ├── ttsService.js      # ElevenLabs streaming TTS
│   │   ├── depthService.js    # Depth Anything V2 pipeline
│   │   └── memoryService.js   # IndexedDB conversation memory
│   ├── utils/
│   │   ├── depthMask.js       # Depth-based object masks
│   │   ├── depthColorMap.js   # Depth value → RGB mapping
│   │   ├── bboxMapper.js      # Gemini bbox → screen coords
│   │   └── frameCapture.js    # Video frame → base64
│   ├── store/
│   │   └── useAppStore.js     # Zustand global state
│   ├── config.js              # API keys + tuning constants
│   └── main.jsx               # App entry point
├── index.html
├── vite.config.js
├── tailwind.config.js
└── package.json

Setup

Prerequisites

Node.js 18+
A modern browser with WebGPU support (Chrome 113+, Edge 113+) for optimal depth performance; falls back to WASM automatically
API keys for Gemini, OpenAI (Whisper), and ElevenLabs

Installation

cd vision-companion
npm install

Configuration

Create or edit src/config.js with your API keys:

export const GEMINI_API_KEY = 'your-gemini-api-key';
export const GEMINI_MODEL = 'gemini-2.5-flash-preview-04-17';
export const OPENAI_API_KEY = 'your-openai-api-key';
export const ELEVENLABS_API_KEY = 'your-elevenlabs-api-key';
export const ELEVENLABS_VOICE_ID = 'your-voice-id';

Run

npm run dev

Open http://localhost:5173 in Chrome. Grant camera and microphone permissions when prompted.

Build for Production

npm run build
npm run preview

How the Modes Work

SCAN Mode (Default)

Continuously analyzes the camera feed every ~2.5 seconds. Detects objects, draws labeled bounding boxes with depth-colored overlays, and updates scene captions. The scan loop runs silently in the background across all modes to keep overlays fresh.

READ Mode

Triggers a one-shot full scene description: spatial layout, object positions, and any visible text read word-for-word. Useful for getting a comprehensive understanding of an unfamiliar environment.

FIND Mode

Optimized for locating specific objects. Ask "Where's the coffee cup?" or "Find the exit sign" and Eyeris will identify and highlight the target with a pulsing bounding box while giving spatial directions.

Voice (Always-On)

All three modes support natural multi-turn conversation. Eyeris maintains a conversation history buffer (last 3 exchanges) so you can ask follow-up questions like "What color is it?" or "How far away?" without repeating context.

Technical Highlights

Zero-backend architecture — Everything runs client-side. API calls go directly from the browser to Gemini, Whisper, and ElevenLabs. No proxy server needed.
On-device depth estimation — Depth Anything V2 Small runs via @huggingface/transformers with WebGPU (fp16) primary and WASM fallback. Cross-origin isolation headers (COOP + COEP: credentialless) enable SharedArrayBuffer for ONNX runtime threading.
Sentence-chunked TTS streaming — Responses are split into sentences and streamed to ElevenLabs in parallel, achieving sub-second time-to-first-audio.
Voice Activity Detection — Custom RMS-based VAD using AudioContext AnalyserNode detects speech onset/offset with configurable thresholds, avoiding false triggers from background noise.
Echo suppression — TTS playback sets a ttsActive flag that suppresses recording, with a 600ms cooldown after TTS ends to prevent the mic from picking up its own output.
Depth-aware object overlays — Objects are masked using median depth sampling within the bounding box center, creating silhouette overlays that match the object's actual shape rather than a simple rectangle fill.
Thinking state UX — When the model is processing, an animated edge glow and pulsing dots provide clear visual feedback that Eyeris is working.

Accessibility

Eyeris is built accessibility-first:

All interactive elements have ARIA labels and roles
High-contrast Bauhaus design system with bold typography (Outfit font)
aria-live regions for dynamic caption updates
Haptic feedback (navigator.vibrate) for obstacle proximity alerts
Fully operable via voice — no touch interaction required after launch
Screen reader compatible throughout

Prize Categories

This project is submitted for:

Best Generative AI Hack — Core submission
(Google) Best AI for Community Impact — AI-powered accessibility tool enabling independence for blind/low-vision users

Development Notes

This project uses @vitejs/plugin-react (Oxc parser) for Fast Refresh. The alternative @vitejs/plugin-react-swc (SWC) is available if you prefer faster transforms.

The React Compiler is not enabled by default due to its impact on dev/build performance — opt in via the Vite config if needed.

For production use, consider adding TypeScript with type-aware lint rules via the TS template and typescript-eslint.

Acknowledgments

Gemini 2.5 Flash by Google DeepMind
Depth Anything V2 by TikTok/ByteDance
ElevenLabs for real-time TTS
OpenAI Whisper for speech recognition
Hugging Face Transformers.js for in-browser ML inference
Built with Railtracks

License

MIT

Built for GenAI Genesis 2026 by Jacob Mobin

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.claude		.claude
railtracks-main		railtracks-main
vision-agent		vision-agent
vision-companion		vision-companion
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
RAILTRACKS.md		RAILTRACKS.md
README.md		README.md
STATUS.md		STATUS.md
SYSTEM_DESIGN.md		SYSTEM_DESIGN.md
UI.md		UI.md
devpost.md		devpost.md
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Eyeris — AI Visual Assistant

What is Eyeris?

Key Features

Demo

How It Works

Architecture

Tech Stack

Project Structure

Setup

Prerequisites

Installation

Configuration

Run

Build for Production

How the Modes Work

SCAN Mode (Default)

READ Mode

FIND Mode

Voice (Always-On)

Technical Highlights

Accessibility

Prize Categories

Development Notes

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Eyeris — AI Visual Assistant

What is Eyeris?

Key Features

Demo

How It Works

Architecture

Tech Stack

Project Structure

Setup

Prerequisites

Installation

Configuration

Run

Build for Production

How the Modes Work

SCAN Mode (Default)

READ Mode

FIND Mode

Voice (Always-On)

Technical Highlights

Accessibility

Prize Categories

Development Notes

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages