Skip to content

sssynk/h4h-ar

Repository files navigation

Prism

Assisted navigation for the visually impaired, powered by Meta Ray-Ban smart glasses, real-time person detection, and cloud-based vision-language guidance.

Prism streams video from Meta Ray-Ban glasses, runs on-device person detection to warn users of nearby pedestrians, and sends frames to a cloud VLM for contextual navigation guidance — all spoken aloud through ElevenLabs TTS with priority-based audio coordination. Haptic feedback pulses faster as users turn toward their next waypoint, giving them a non-visual sense of direction.

How It Works

The app connects to Meta Ray-Ban glasses over Bluetooth, pulls a live video stream, and processes each frame two ways:

  1. On-device YOLO inference detects people and triggers spoken alerts ("person on the left") or a stop sound when someone fills most of the frame.
  2. Cloud VLM requests send a snapshot plus navigation context (current instruction, heading offset, distance to next turn) to an AMD cloud endpoint. The model returns a short guidance sentence that gets spoken to the user.

Meanwhile, CoreLocation tracks the user's GPS position and compass heading. The app computes the bearing to the next turn, derives a heading offset, and drives a haptic engine that pulses at variable rates — rapid when facing the right direction, slow when off-course, a single firm tap when locked on.

Architecture

Meta Ray-Ban Glasses
        │ BLE video stream
        ▼
┌──────────────────────┐
│  StreamSessionVM     │──── frames ──┬──► YOLODetector (on-device)
│  (MWDATCamera SDK)   │              │       │
└──────────────────────┘              │       ├─► "Person on the left/right"
                                      │       └─► Stop sound (≥75% of frame)
                                      │
                                      └──► VLMService (cloud)
                                              │
                                              └─► Contextual guidance
                                                     │
                                                     ▼
                                            SpeechPriorityManager
                                                     │
                                              ┌──────┴──────┐
                                              │ High: VLM   │
                                              │ Low: Alerts  │
                                              └──────┬──────┘
                                                     │
                                                     ▼
                                              ElevenLabs TTS
                                                     │
                                                     ▼
                                            Speaker / AirPods

CoreLocation ──► NavigationViewModel ──► HapticGuidanceEngine
  (GPS + heading)    (bearing calc)         (variable pulse rate)

Services

VLMService

Sends camera frames to the cloud VLM at h4h.excused.ai/navigate/json. Each request includes:

  • image_base64 — JPEG snapshot from the glasses
  • direction — current step instruction (e.g. "Turn right onto El Camino Real")
  • turn_degrees — heading offset from the user's current facing to the target bearing. Negative = turn left, positive = turn right.
  • distance — live distance to the next turn (e.g. "350 ft")

The server runs a vision-language model on AMD cloud GPUs, analyzes the scene in context, and returns a short guidance sentence. Requests fire every 8 seconds during navigation, plus an immediate snapshot when the user is within ~10 feet of a turn.

YOLODetector

Runs a custom-trained YOLOv8 model (yolov8_h4h_v2) on-device via CoreML and the Vision framework. Processes frames at ~5 FPS, filtered to person detections with ≥45% confidence and a minimum bounding box area of 2%. Detection results drive both the bounding box overlay on the video feed and the person alert system in NavigationViewModel.

HapticGuidanceEngine

Maps the user's heading offset to haptic pulse intervals using UIImpactFeedbackGenerator:

Offset Interval Feel
0-5° Single heavy tap Locked on target
5-15° 0.1s Rapid buzzing
15-30° 0.2s Fast
30-60° 0.4s Moderate
60-90° 0.7s Slow
90-150° 1.2s Very slow
150-180° 2.0s Behind

Uses a self-rescheduling timer that adjusts on every heading update. Pulses are continuous at every angle so the user always knows the system is active.

ElevenLabsTTSService

Synthesizes speech through the ElevenLabs API using the "Sarah" voice on the eleven_flash_v2_5 model. Returns MP3 audio data that gets played through AVAudioPlayer. Falls back to iOS system TTS if the API call fails.

SpeechPriorityManager

Coordinates two audio streams that would otherwise talk over each other:

  • High priority — VLM navigation guidance
  • Low priority — person detection alerts

If TTS is idle, any request plays immediately. If it's busy with a lower-priority message, the higher-priority one queues and plays next. Same or lower priority messages get dropped. This prevents "person on the left" from interrupting "turn right at the next intersection."

LocationManager

Wraps CLLocationManager with published properties for location and heading. Distance filter is 5 meters, heading filter is 5 degrees. Starts both location and heading updates on authorization.

ViewModels

NavigationViewModel

The orchestrator. Manages search, route calculation, step progression, and all the real-time loops:

  • Guidance loop — every 8s, grabs a frame and calls the VLM
  • Person alert loop — every 2s, checks YOLO detections for left/right alerts and large-person stop sounds
  • Step progression — on each location update, checks distance to current step target. Advances when within 30m, triggers a proximity VLM snapshot within 3m.
  • Heading offset — computes great-circle bearing to the next waypoint, normalizes the difference with the user's compass heading, feeds the haptic engine
  • Live distance — recalculates distance to the next turn on every GPS update

Also handles destination search (MKLocalSearch), route calculation (MKDirections), pin drops with reverse geocoding, and full cleanup on navigation end.

StreamSessionViewModel

Manages the video stream from Meta Ray-Ban glasses via the MWDATCamera SDK. Receives frames, stores the latest one for VLM requests, and runs YOLO detection on each frame. Handles camera permissions and stream lifecycle.

WearablesViewModel

Handles Bluetooth connection to Meta Ray-Ban glasses through the MetaDATCore Wearables SDK. Tracks registration state, connected devices, and signal strength (RSSI mapped to 0-3 bars).

TTSViewModel

Manages ElevenLabs TTS synthesis and playback. Handles AVAudioSession configuration, audio route detection (speaker vs. AirPods vs. Bluetooth), and fallback to system AVSpeechSynthesizer.

Cloud VLM Integration

The VLM runs on AMD Instinct MI300X GPUs. The app sends JPEG frames with navigation metadata to h4h.excused.ai, which hosts the model behind a JSON API. The server receives the image, the step instruction, the user's heading offset in degrees, and the live distance to the next turn. It returns a single guidance sentence that accounts for both what the camera sees (obstacles, crosswalks, doors) and where the user needs to go.

The heading offset lets the model give directional instructions like "turn about 45 degrees to your left" instead of generic "turn left." The live distance lets it say "your turn is coming up in 50 feet" rather than a static segment length.

The degree-to-direction mapping on the server:

Degrees Phrase
< 10° continue straight ahead
10-30° slight left / slight right
30-60° bear left / bear right
60-110° turn left / turn right
110-160° sharp left / sharp right
≥ 160° turn around

Setup

  1. Open Prism/Prism.xcodeproj in Xcode
  2. Add yolov8_h4h_v2.mlmodelc to the target's Copy Bundle Resources
  3. Build and run on a physical iPhone (CoreML + Bluetooth require a device)
  4. On the Connect tab, pair your Meta Ray-Ban glasses
  5. Switch to Navigate, search for a destination, and start navigation

Requirements

  • iOS 18+
  • Meta Ray-Ban smart glasses (for live video)
  • ElevenLabs API key (for TTS)
  • Network access (for VLM cloud endpoint)

About

Assisted navigation for the visually impaired, powered by Meta Ray-Ban smart glasses, real-time person detection, and cloud-based vision-language guidance.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors