SoundScape UW
Inspiration
Navigating the world as a blind or low-vision individual presents daily challenges that most people take for granted. Existing navigation apps focus on turn-by-turn directions but fail to provide real-time awareness of immediate surroundings - crosswalks, obstacles, pedestrian signals, and dynamic environmental changes. I was inspired to create SoundScape, a voice-first mobile assistant that acts as a constant companion, providing continuous spatial awareness and safety guidance through intuitive audio cues.
It's directly inspired by the "buddy system" from elementary school - except smarter and more reliable.
What it does
SoundScape UW is a real-time pedestrian navigation assistant designed specifically for blind and low-vision users. The app continuously analyzes the user's surroundings using their phone's camera and provides:
- Real-time Guidance: Automatic audio alerts about crosswalks, obstacles, pedestrian signals, and navigation hazards
- Voice Commands: Hands-free control with natural language commands like "Where am I?", "Guide me", "What do you see?", and "Stop"
- Location Awareness: On-demand reverse geocoding that announces street names and intersections
- Scene Description: Detailed narration of the current environment when requested
- Adaptive Alerts: Smart debouncing and haptic feedback to avoid overwhelming the user with information
- Safe Mode: Conservative guidance settings for extra caution
The app runs entirely on the phone, requiring no additional hardware, and works seamlessly with voice-only interaction for true hands-free operation.
How we built it
Technology Stack:
- React Native + Expo: Cross-platform mobile development framework for rapid prototyping
- TypeScript: Type-safe development for robust code
- Google Gemini AI: Computer vision model for real-time scene analysis and object detection
- Google Cloud Speech-to-Text: Voice command recognition that works in Expo Go
- ElevenLabs API: High-quality, natural-sounding text-to-speech for audio guidance
- Google Maps Geocoding API: Reverse geocoding for location awareness
- expo-camera: Real-time camera frame capture at configurable intervals
- expo-av: Audio recording and playback management
- expo-location: GPS location services
- expo-haptics: Tactile feedback for different alert types
Architecture:
- Modular service-based architecture separating concerns (camera, audio, location, AI)
- Custom guidance derivation engine that intelligently debounces and prioritizes alerts
- Accessible UI following WCAG guidelines with high contrast and large touch targets
- Minimalist Notion-inspired design with transparent controls for maximum camera visibility
Challenges we ran into
iOS Audio/Camera Conflicts: iOS doesn't allow the camera and microphone to be active simultaneously, causing the app to crash when switching between guidance mode and voice commands. We spent significant time debugging audio session management and implementing proper cleanup routines.
Speech Recognition Accuracy: Google Cloud Speech-to-Text proved inconsistent, often returning empty transcripts even with valid audio. The API is very sensitive to audio levels and background noise, requiring users to speak loudly and clearly.
Real-time Performance: Balancing frame capture frequency, AI processing time, and audio playback without overwhelming the user required careful tuning of debounce intervals and capture rates.
Guidance Logic Complexity: Determining when to speak, what to say, and how urgent the information is required building a sophisticated priority system that considers previous guidance, timing, and safety criticality.
Expo Go Limitations: Many native speech recognition libraries require custom development builds, forcing us to implement a cloud-based solution for voice commands to maintain compatibility with Expo Go for rapid testing.
Accomplishments that we're proud of
- Voice-First Design: Successfully implemented a fully hands-free interface that works entirely through voice commands and audio feedback. This is a huge form factor win for the visually impaired, and I think it also makes the AI a much better "buddy" when it can listen and talk back :) Shoutout to ElevenLabs and Google Cloud's Speech-To-Text
- Real-time AI Vision: Achieved smooth real-time scene analysis with Gemini AI providing contextual safety guidance grounded with geospatial data from Google Maps
- Accessible UX: Created a clean, minimalist interface that maximizes camera visibility while remaining fully accessible
- Intelligent Guidance: Built a smart debouncing system that balances information delivery with user experience, avoiding audio overload
- Cross-Platform: Developed a solution that works seamlessly on both iOS and Android with Expo Go
- Location Integration: Implemented precise intersection and street address detection using reverse geocoding
- Audio Quality: Integrated high-quality empathetic TTS using ElevenLabs with optimized voice settings for clarity and naturalness
What we learned
- The complexity of audio session management on iOS, especially when coordinating camera, microphone, and audio playback
- Computer vision models like Gemini can provide rich, contextual scene understanding beyond simple object detection
- Voice interfaces require careful consideration of timing, feedback, and error handling to feel natural
- Accessibility-first design benefits all users, not just those with disabilities
- Cloud-based speech recognition APIs have significant limitations around audio quality sensitivity
- Real-time mobile AI applications require balancing performance, battery life, and user experience
- The importance of debouncing and rate-limiting in preventing information overload
- TypeScript and modular architecture are essential for maintaining complex React Native applications
What's next for SoundScape UW
Short-term improvements:
- Implement streaming Speech-to-Text for better recognition accuracy and lower latency
- Add offline mode with on-device ML models for areas with poor connectivity
- Integrate with public transit APIs for bus/train arrival information
- Add customizable voice personas and speaking rates
- Implement audio beacons for point-of-interest navigation
Long-term vision:
- Indoor navigation using ARKit/ARCore for shopping malls and buildings
- Community-sourced accessibility data (ramps, elevators, accessible entrances)
- Integration with smart city infrastructure (traffic lights, crossing signals)
- Social features for sharing safe routes and accessibility reviews
- Wearable integration (AirPods spatial audio, Apple Watch haptics)
- Multi-language support for international travelers
- Machine learning personalization based on user preferences and walking patterns
Our ultimate goal is to make SoundScape UW the essential navigation companion for blind and low-vision individuals, providing the confidence and independence to explore the world safely.
Built With
- elevenlabs
- gemini
- google-maps
- google-speech-to-text
- react-native
- typescript

Log in or sign up for Devpost to join the conversation.