Inspiration:

The vision for Iris came from something close to home—one of our team member’s grandfather is visually impaired, and navigating everyday spaces has always been a challenge for him. He’s one of 50 million people facing these difficulties daily. Motivated to give the blind the chance to experience their surroundings, we set out to develop a product to make navigation easier and safer for visually impaired individuals.

What it does:

Iris is an application paired with a wearable device (phone) that helps visually impaired users navigate safely. It detects potential hazards and informational signs in real time through a live video stream, providing audio feedback to guide users through their environment. Users automatically receive notifications about a hazard or informational sign, but they can also ask questions to the application at any time to learn more about their surroundings.

How we built it:

The frontend of Iris is a Swift application that captures a live video stream and employs a bare metal on-device model to detect potential hazards. For efficient data transmission with the backend, we implemented a WebSocket protocol that takes in frames. The backend features our Claude model, enhanced with vector embeddings, and LangChain, which aligns Claude’s responses. To achieve quick end-to-end latency while maintaining contextual reasoning, we focused on two key optimizations. First, we improved response performance through prompt engineering and enforced a structured response schema to ensure accurate hazard detection and scene interpretation. Second, we enhanced responsiveness by utilizing asynchronous I/O and thread pooling to effectively manage concurrent tasks.

Challenges we ran into:

One of the initial challenges we faced was setting up the live video stream in our Swift application and ensuring communication with our backend server. To resolve this, we implemented several additional features in the Xcode information property list, including:

  • Privacy - Camera Usage Description: A prompt requesting permission to access the device's camera.
  • Transport Security: This setting enables secure communication with our server.
  • Application supports iTunes file sharing: This feature allows users to share images captured by the app into the app’s Files section.

Another frontend challenge was integrating a custom YOLOv11 model into our Swift application. Initially, we used a MobileVNet2 ML model, but it struggled with object recognition. We then integrated a functioning YOLOv3 model, followed by a custom-trained YOLOv11 model, which significantly improved object detection and recognition performance.

On the backend, a key challenge was minimizing end-to-end latency to ensure users received timely audio guidance for navigation. We addressed this by implementing asynchronous I/O and thread pooling, allowing us to manage concurrent tasks efficiently. Additionally, we faced difficulties in generating effective response prompts for detecting hazardous objects, which resulted in less-than-optimal outcomes. We tackled this through prompt engineering, experimenting with various chain-of-thought prompting techniques, and refining our response schema to enhance accuracy and reliability.

One minor challenge was creating a realistic wearable device for users that utilizes a phone camera. We used the 3D printer at the HIVE to design and produce a custom wearable that holds the phone horizontally (patent pending :) ). We insisted on using a phone camera instead of a GoPro or smart glasses because it is affordable and widely accessible to everyone.

Accomplishments that we're proud of:

  • Frontend Accomplishments: Effectively captured live video feed from an external device and integrated a custom-trained YOLOv11 model into the Swift application.
  • Backend Accomplishments: Leveraged the Claude Haiku model with vector embeddings to achieve an optimal balance between vision inference latency and reasoning depth. Additionally, we implemented asynchronous I/O to enhance responsiveness and ensure efficient processing for accurate hazard detection and response alignment.
  • Overall Accomplishment: Developed a cohesive product that seamlessly connects the frontend and backend using the WebSocket protocol, along with a realistic wearable to represent user situations.

What we learned:

We honed in on how to leverage the Claude model and LangChain effectively, particularly when it comes to optimizing inference latency while keeping our responses contextually coherent. Through this process, we recognized the importance of prompt engineering techniques, like response schema enforcement and chain-of-thought prompting, which helped us achieve high accuracy without the need for expensive fine-tuning. We also picked up valuable skills in Swift, Xcode, and WebSocket implementation, all of which played a key role in connecting the frontend and backend smoothly. Together, these experiences culminated in the development of an end-to-end product that takes a live video stream as input and provides audio feedback with navigation guidance for the user.

What's next for Iris:

Moving forward, we plan to introduce personalized feedback to enhance each user’s experience. This will involve adjusting auto notifications and manual user queries according to individual preferences and ensuring that responses resonate with the user’s language style and word choices. Additionally, we aim to integrate haptic feedback to create a more immersive navigation experience. We also intend to boost response accuracy and reduce latency by optimizing token usage and implementing other improvements to the backend.

Built With

Share this project:

Updates