Assisted navigation for the visually impaired, powered by Meta Ray-Ban smart glasses, real-time person detection, and cloud-based vision-language guidance.
Prism streams video from Meta Ray-Ban glasses, runs on-device person detection to warn users of nearby pedestrians, and sends frames to a cloud VLM for contextual navigation guidance — all spoken aloud through ElevenLabs TTS with priority-based audio coordination. Haptic feedback pulses faster as users turn toward their next waypoint, giving them a non-visual sense of direction.
The app connects to Meta Ray-Ban glasses over Bluetooth, pulls a live video stream, and processes each frame two ways:
- On-device YOLO inference detects people and triggers spoken alerts ("person on the left") or a stop sound when someone fills most of the frame.
- Cloud VLM requests send a snapshot plus navigation context (current instruction, heading offset, distance to next turn) to an AMD cloud endpoint. The model returns a short guidance sentence that gets spoken to the user.
Meanwhile, CoreLocation tracks the user's GPS position and compass heading. The app computes the bearing to the next turn, derives a heading offset, and drives a haptic engine that pulses at variable rates — rapid when facing the right direction, slow when off-course, a single firm tap when locked on.
Meta Ray-Ban Glasses
│ BLE video stream
▼
┌──────────────────────┐
│ StreamSessionVM │──── frames ──┬──► YOLODetector (on-device)
│ (MWDATCamera SDK) │ │ │
└──────────────────────┘ │ ├─► "Person on the left/right"
│ └─► Stop sound (≥75% of frame)
│
└──► VLMService (cloud)
│
└─► Contextual guidance
│
▼
SpeechPriorityManager
│
┌──────┴──────┐
│ High: VLM │
│ Low: Alerts │
└──────┬──────┘
│
▼
ElevenLabs TTS
│
▼
Speaker / AirPods
CoreLocation ──► NavigationViewModel ──► HapticGuidanceEngine
(GPS + heading) (bearing calc) (variable pulse rate)
Sends camera frames to the cloud VLM at h4h.excused.ai/navigate/json. Each request includes:
image_base64— JPEG snapshot from the glassesdirection— current step instruction (e.g. "Turn right onto El Camino Real")turn_degrees— heading offset from the user's current facing to the target bearing. Negative = turn left, positive = turn right.distance— live distance to the next turn (e.g. "350 ft")
The server runs a vision-language model on AMD cloud GPUs, analyzes the scene in context, and returns a short guidance sentence. Requests fire every 8 seconds during navigation, plus an immediate snapshot when the user is within ~10 feet of a turn.
Runs a custom-trained YOLOv8 model (yolov8_h4h_v2) on-device via CoreML and the Vision framework. Processes frames at ~5 FPS, filtered to person detections with ≥45% confidence and a minimum bounding box area of 2%. Detection results drive both the bounding box overlay on the video feed and the person alert system in NavigationViewModel.
Maps the user's heading offset to haptic pulse intervals using UIImpactFeedbackGenerator:
| Offset | Interval | Feel |
|---|---|---|
| 0-5° | Single heavy tap | Locked on target |
| 5-15° | 0.1s | Rapid buzzing |
| 15-30° | 0.2s | Fast |
| 30-60° | 0.4s | Moderate |
| 60-90° | 0.7s | Slow |
| 90-150° | 1.2s | Very slow |
| 150-180° | 2.0s | Behind |
Uses a self-rescheduling timer that adjusts on every heading update. Pulses are continuous at every angle so the user always knows the system is active.
Synthesizes speech through the ElevenLabs API using the "Sarah" voice on the eleven_flash_v2_5 model. Returns MP3 audio data that gets played through AVAudioPlayer. Falls back to iOS system TTS if the API call fails.
Coordinates two audio streams that would otherwise talk over each other:
- High priority — VLM navigation guidance
- Low priority — person detection alerts
If TTS is idle, any request plays immediately. If it's busy with a lower-priority message, the higher-priority one queues and plays next. Same or lower priority messages get dropped. This prevents "person on the left" from interrupting "turn right at the next intersection."
Wraps CLLocationManager with published properties for location and heading. Distance filter is 5 meters, heading filter is 5 degrees. Starts both location and heading updates on authorization.
The orchestrator. Manages search, route calculation, step progression, and all the real-time loops:
- Guidance loop — every 8s, grabs a frame and calls the VLM
- Person alert loop — every 2s, checks YOLO detections for left/right alerts and large-person stop sounds
- Step progression — on each location update, checks distance to current step target. Advances when within 30m, triggers a proximity VLM snapshot within 3m.
- Heading offset — computes great-circle bearing to the next waypoint, normalizes the difference with the user's compass heading, feeds the haptic engine
- Live distance — recalculates distance to the next turn on every GPS update
Also handles destination search (MKLocalSearch), route calculation (MKDirections), pin drops with reverse geocoding, and full cleanup on navigation end.
Manages the video stream from Meta Ray-Ban glasses via the MWDATCamera SDK. Receives frames, stores the latest one for VLM requests, and runs YOLO detection on each frame. Handles camera permissions and stream lifecycle.
Handles Bluetooth connection to Meta Ray-Ban glasses through the MetaDATCore Wearables SDK. Tracks registration state, connected devices, and signal strength (RSSI mapped to 0-3 bars).
Manages ElevenLabs TTS synthesis and playback. Handles AVAudioSession configuration, audio route detection (speaker vs. AirPods vs. Bluetooth), and fallback to system AVSpeechSynthesizer.
The VLM runs on AMD Instinct MI300X GPUs. The app sends JPEG frames with navigation metadata to h4h.excused.ai, which hosts the model behind a JSON API. The server receives the image, the step instruction, the user's heading offset in degrees, and the live distance to the next turn. It returns a single guidance sentence that accounts for both what the camera sees (obstacles, crosswalks, doors) and where the user needs to go.
The heading offset lets the model give directional instructions like "turn about 45 degrees to your left" instead of generic "turn left." The live distance lets it say "your turn is coming up in 50 feet" rather than a static segment length.
The degree-to-direction mapping on the server:
| Degrees | Phrase |
|---|---|
| < 10° | continue straight ahead |
| 10-30° | slight left / slight right |
| 30-60° | bear left / bear right |
| 60-110° | turn left / turn right |
| 110-160° | sharp left / sharp right |
| ≥ 160° | turn around |
- Open
Prism/Prism.xcodeprojin Xcode - Add
yolov8_h4h_v2.mlmodelcto the target's Copy Bundle Resources - Build and run on a physical iPhone (CoreML + Bluetooth require a device)
- On the Connect tab, pair your Meta Ray-Ban glasses
- Switch to Navigate, search for a destination, and start navigation
- iOS 18+
- Meta Ray-Ban smart glasses (for live video)
- ElevenLabs API key (for TTS)
- Network access (for VLM cloud endpoint)