Visionary

our logo
object detection
the simplistic and intuitive start screen
our tech stack

Inspiration

Globally, over 285 million people are visually impaired, with 39 million being completely blind. These individuals face daily challenges that many of us take for granted—recognizing familiar faces, identifying objects, reading signs, and safely navigating public spaces. 62% of visually impaired people report difficulties in performing basic tasks independently, which makes this a huge problem in today's society.

From our own personal experiences, we have grandparents and great-grandparents who are slowly losing their eyesight. Seeing loved ones deal with these frustrations has been eye-opening. Simple activities, like recognizing a friend's face in a crowd or reading a street sign, become big obstacles in their daily life. Constant reliance on others for help can deeply impact people's sense of independence and dignity. We wanted to use the power of computing to help solve this huge issue. That is where Visionary comes into play.

What it does

*Visionary is a powerful digital assistant designed specifically for visually impaired users. It combines real-time object detection, familiar face recognition, and text-to-speech integration to provide essential feedback about the user’s surroundings. Users can quickly identify nearby objects, recognize friends or family members that they have in their contacts, and have signs or text read aloud to them. The app also features an intuitive user interface— you can sync contacts with just three shakes and hold to start video detection, offering a seamless experience designed for ease of use.

How we built it

We built Visionary using a combination of machine learning and computer vision technologies. First, we built an intuitive mobile application using Swift. We used this to access the user's camera so we can run our machine learning model.

For real-time object detection, we integrated YOLOv8 and TensorFlow, fine-tuning models to accurately recognize objects in the user's environment. We also used LiDAR for detecting distances of an object, so the user knows what object is near them but also how much space is between them and the object. For face recognition, we used FaceNet and OpenCV, allowing the app to distinguish between familiar faces and strangers by syncing with the user’s contacts. Apple OCR and Natural Language Processing handle the text processing, extracting and interpreting text from the user’s surroundings, while the we also use Apple's SpeakText technology gives the user smooth, natural-sounding audio feedback.

Additionally, we fine tuned our YOLOv8 model with a dataset of 40000 images with text so it is accurate at detecting text in the first place so that it could know when to run the OCR. Throughout the development, we focused on creating a streamlined, intuitive interface to ensure the app was easy to use without visual cues.

Challenges we ran into

One of the biggest challenges we faced was optimizing the app for real-time performance. Object detection, face recognition, and text processing are all resource-intensive tasks, especially when running on mobile devices. Achieving low latency without sacrificing accuracy required careful fine-tuning of our models and efficient resource management. This lead us to think of intuitive solutions, like adjusting how many frames we should less pass by before we run identification models, and playing around with the models we chose for each task (we needed a balance of efficiency and accuracy).

Another problem we had was distance detection. Research is very scarce in this area, and not many models exist to accurately detect distance of an object. This was important, however, since knowing that an object is there is not enough for visually impaired people; they also need to know how far away the objects are. We ended up using LiDAR from the iPhone for distance detection, which is a method that uses laser pulses to measure distances and create 3D models of the Earth's surface.

Additionally, we needed to ensure that the app was user-friendly for visually impaired individuals, which meant rethinking traditional interfaces and creating gesture-based controls that were simple yet powerful. However, this pushed us out of our comfort zone and made us step in the shoes of visually impaired people to see which tasks would be easiest for them. This led us to design a simplistic but accessible UI that is easy to use for anyone with vision disabilities.

Accomplishments that we're proud of

We’re proud to have developed a product that can truly make a difference in people’s lives. One of our key accomplishments was successfully integrating real-time object detection, face recognition, and text-to-speech into a single, cohesive application. This is incredibly difficult because these are very intensive tasks, but with some creative solutions, we managed to create a product that is accurate while also being efficient.
We also managed to create an intuitive gesture-based interface, allowing users to control the app effortlessly without the need for visual feedback.
One of our major accomplishments was fine-tuning the YOLO model to get highly accurate text recognition. We optimized the model to handle different fonts, angles, and lighting conditions, ensuring that the app could read signs, labels, and documents reliably in real-world scenarios.
We also integrated LiDAR technology into the app, allowing it to calculate the distance of objects from the user. This feature adhttps://www.visionbeyondsight.co/ds an extra layer of safety, enabling users to navigate more confidently by knowing not only what objects are around them but also how far away they are.
We also put a lot of effort into designing a sleek, modern website that reflects the functionality and accessibility of Visionary. The site is not only visually appealing but also optimized for ease of navigation, ensuring that users, including those with visual impairments, can quickly understand the product and its benefits.

What we learned

Throughout the development of Visionary, we learned just how crucial it is to build products with accessibility in mind. Creating an app for visually impaired users forced us to challenge conventional design principles and think outside the box. We actually were planning the app with a really cool looking UI and were going to make the functionality complex to display different modes and customizations, but we soon realized that our product was for visually impaired people. We ended up erasing our whole whiteboard and started planning with our target audience in mind :). This experience taught us to consider accessibility and audience.
We also gained valuable experience in optimizing machine learning models for real-time performance and managing the complexities of processing multiple data streams—objects, faces, and text—all at once. We got a chance to fine train our models, implement features like text-to-speech with the ML model output, and got to play around with ways to optimize our code.
We also learned how to design a sleek and modern website as a display our our products and functionality. We are very proud of our frontend in the website and the UI that we have there as well.

What's next for Visionary

Geolocation Integration: We’re planning to integrate geolocation services to transform Visionary into a true digital walking stick. With indoor and outdoor navigation features alongside the walking stick, users will be able to move confidently.
Improved Machine Learning Models: Our focus will be on better training and fine-tuning all of our machine learning models. From enhancing the accuracy of object detection to improving familiar face recognition, we’ll leverage more data and refined algorithms to make Visionary even smarter and more reliable.
Using LLMs for Text-to-Speech Filtering: We’re also exploring the use of Large Language Models (LLMs) to improve the relevance and clarity of our text-to-speech output. This can make us provide more meaningful, context-aware feedback to users without overwhelming them with unnecessary details.

We plan to use this idea to create our very own startup one day. While we only had 36 hours to implement the digital walking stick, with more time dedicated to implement new features, we believe Visionary has the power to make life better for every visually impaired person.

Built With

facenet
keras
lidar
natural-language-processing
ocr
opencv
swift
tensorflow
yolov8

Updates

Kirtan Thakkar started this project — Sep 29, 2024 07:51 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.