SoundScape UW

Inspiration

Navigating the world as a blind or low-vision individual presents daily challenges that most people take for granted. Existing navigation apps focus on turn-by-turn directions but fail to provide real-time awareness of immediate surroundings - crosswalks, obstacles, pedestrian signals, and dynamic environmental changes. I was inspired to create SoundScape, a voice-first mobile assistant that acts as a constant companion, providing continuous spatial awareness and safety guidance through intuitive audio cues.

It's directly inspired by the "buddy system" from elementary school - except smarter and more reliable.

What it does

SoundScape UW is a real-time pedestrian navigation assistant designed specifically for blind and low-vision users. The app continuously analyzes the user's surroundings using their phone's camera and provides:

Real-time Guidance: Automatic audio alerts about crosswalks, obstacles, pedestrian signals, and navigation hazards
Voice Commands: Hands-free control with natural language commands like "Where am I?", "Guide me", "What do you see?", and "Stop"
Location Awareness: On-demand reverse geocoding that announces street names and intersections
Scene Description: Detailed narration of the current environment when requested
Adaptive Alerts: Smart debouncing and haptic feedback to avoid overwhelming the user with information
Safe Mode: Conservative guidance settings for extra caution

The app runs entirely on the phone, requiring no additional hardware, and works seamlessly with voice-only interaction for true hands-free operation.

How we built it

Technology Stack:

React Native + Expo: Cross-platform mobile development framework for rapid prototyping
TypeScript: Type-safe development for robust code
Google Gemini AI: Computer vision model for real-time scene analysis and object detection
Google Cloud Speech-to-Text: Voice command recognition that works in Expo Go
ElevenLabs API: High-quality, natural-sounding text-to-speech for audio guidance
Google Maps Geocoding API: Reverse geocoding for location awareness
expo-camera: Real-time camera frame capture at configurable intervals
expo-av: Audio recording and playback management
expo-location: GPS location services
expo-haptics: Tactile feedback for different alert types

Architecture:

Modular service-based architecture separating concerns (camera, audio, location, AI)
Custom guidance derivation engine that intelligently debounces and prioritizes alerts
Accessible UI following WCAG guidelines with high contrast and large touch targets
Minimalist Notion-inspired design with transparent controls for maximum camera visibility

Challenges we ran into

iOS Audio/Camera Conflicts: iOS doesn't allow the camera and microphone to be active simultaneously, causing the app to crash when switching between guidance mode and voice commands. We spent significant time debugging audio session management and implementing proper cleanup routines.
Speech Recognition Accuracy: Google Cloud Speech-to-Text proved inconsistent, often returning empty transcripts even with valid audio. The API is very sensitive to audio levels and background noise, requiring users to speak loudly and clearly.
Real-time Performance: Balancing frame capture frequency, AI processing time, and audio playback without overwhelming the user required careful tuning of debounce intervals and capture rates.
Guidance Logic Complexity: Determining when to speak, what to say, and how urgent the information is required building a sophisticated priority system that considers previous guidance, timing, and safety criticality.
Expo Go Limitations: Many native speech recognition libraries require custom development builds, forcing us to implement a cloud-based solution for voice commands to maintain compatibility with Expo Go for rapid testing.

Accomplishments that we're proud of

Voice-First Design: Successfully implemented a fully hands-free interface that works entirely through voice commands and audio feedback. This is a huge form factor win for the visually impaired, and I think it also makes the AI a much better "buddy" when it can listen and talk back :) Shoutout to ElevenLabs and Google Cloud's Speech-To-Text
Real-time AI Vision: Achieved smooth real-time scene analysis with Gemini AI providing contextual safety guidance grounded with geospatial data from Google Maps
Accessible UX: Created a clean, minimalist interface that maximizes camera visibility while remaining fully accessible
Intelligent Guidance: Built a smart debouncing system that balances information delivery with user experience, avoiding audio overload
Cross-Platform: Developed a solution that works seamlessly on both iOS and Android with Expo Go
Location Integration: Implemented precise intersection and street address detection using reverse geocoding
Audio Quality: Integrated high-quality empathetic TTS using ElevenLabs with optimized voice settings for clarity and naturalness

What we learned

The complexity of audio session management on iOS, especially when coordinating camera, microphone, and audio playback
Computer vision models like Gemini can provide rich, contextual scene understanding beyond simple object detection
Voice interfaces require careful consideration of timing, feedback, and error handling to feel natural
Accessibility-first design benefits all users, not just those with disabilities
Cloud-based speech recognition APIs have significant limitations around audio quality sensitivity
Real-time mobile AI applications require balancing performance, battery life, and user experience
The importance of debouncing and rate-limiting in preventing information overload
TypeScript and modular architecture are essential for maintaining complex React Native applications

What's next for SoundScape UW

Short-term improvements:

Implement streaming Speech-to-Text for better recognition accuracy and lower latency
Add offline mode with on-device ML models for areas with poor connectivity
Integrate with public transit APIs for bus/train arrival information
Add customizable voice personas and speaking rates
Implement audio beacons for point-of-interest navigation

Long-term vision:

Indoor navigation using ARKit/ARCore for shopping malls and buildings
Community-sourced accessibility data (ramps, elevators, accessible entrances)
Integration with smart city infrastructure (traffic lights, crossing signals)
Social features for sharing safe routes and accessibility reviews
Wearable integration (AirPods spatial audio, Apple Watch haptics)
Multi-language support for international travelers
Machine learning personalization based on user preferences and walking patterns

Our ultimate goal is to make SoundScape UW the essential navigation companion for blind and low-vision individuals, providing the confidence and independence to explore the world safely.

Built With

elevenlabs
gemini
google-maps
google-speech-to-text
react-native
typescript

Updates

Jonathan Zhizheng Shi started this project — Oct 19, 2025 04:12 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.