Inspiration
Aria was inspired by the desire to empower blind and visually impaired individuals by utilizing the technology already in their hands, the smartphone. We aimed to eliminate reliance on expensive or bulky hardware and develop an intelligent assistant that enhances human capabilities through real-time environment understanding, communication assistance, and navigation.
What it does
Aria transforms the phone/iPad camera and microphone into an extension of the user’s senses. It continuously scans surroundings to map obstacles and pathways, provides live spoken translation for conversations or signs, and offers voice-guided navigation, all controlled with simple hand gestures to switch modes hands-free.
How we built it
We leveraged Google's Gemini API for real-time scene understanding to analyze the camera feed and detect obstacles, and map a distance layer from the origin. ElevenLabs API was used for natural, high-quality text-to-speech feedback for the user. MediaPipe was integrated for hand gesture recognition, enabling easy switching between environment mapping, translation, and navigation modes. The app was built for iOS using Swift and iOS AR-VRKit, ensuring scalability and low latency.
Challenges we ran into
- I ran into major threading issues when the camera tried to update UI from background threads. Swift's actor isolation was completely new to me, and I spent hours debugging race conditions before finally understanding @MainActor and nonisolated methods.
- I ran into major threading issues when the camera tried to update UI from background threads. Swift's actor isolation was completely new to me, and I spent hours debugging race conditions before finally understanding @MainActor and nonisolated methods.
- ElevenLabs only gives 10,000 free characters per month, so I had to be smart about when to use premium voice. I built a tracking system and prioritized important content, but it was stressful worrying about hitting limits during the demo.
Accomplishments that we're proud of
- I created a completely gesture-controlled interface that works without ever touching the screen. The mode-locking system means users can freely move around while staying in their chosen mode. That's accessibility done right.
- I successfully integrated multimodal system ( Gemini Vision, Eleven Labs, and Google Maps) into a cohesive app in under 20 hours.
What we learned
- I had no idea Apple's built-in hand detection was this good. Running gesture recognition entirely on-device meant zero latency, and it works offline, way better than trying to build my own ML model.
- Swift's modern concurrency was a steep learning curve, but understanding actors, MainActor, and proper async patterns made my code so much cleaner and bug-free.
- Building for blind users forced me to think completely differently. Every feature—haptics, voice announcements, gesture control, had to work without seeing the screen. This mindset shift was invaluable.
What's next for Aria
- I want to combine Google Maps directions with continuous Gemini obstacle scanning, so users get both "turn left in 500 feet" AND "watch for stairs ahead."
- Add speech-to-text so users can say "Navigate to Starbucks" without touching anything. Completely hands-free, eyes-free experience.
- Let Aria "remember" frequently visited places like home or office, providing better guidance in familiar spaces even offline.
Built With
- ar
- elevenlabs
- gemini
- google-directions
- google-maps
- swift
- swiftui


Log in or sign up for Devpost to join the conversation.