Inspiration

There has been a recent rise in multi-modal Large Language Models (LLM), such as GPT-4, Google Gemini, Mistral, etc. These tools enable some interesting ideas that would be great to general assistance for visually impaired individuals, allowing them to enhance their spatial awareness through human-agent interactions. Given these advancements in Natural Language Processing (NLP) and Computer Vision (CV) and their industrial applications in the tech field, we thought to deliver a verbal assistant for the user in hopes of providing a novel approach for understanding their surroundings.

What it does

This project prototypes the design of a potentially wearable item for visually impaired individuals. Users can start conversations with the assistant and ask it about its surroundings, which is transmitted through the camera. The assistant will describe the overall surrounding or specific parts of the surrounding based on the user's specific question.

The user can continuously ask questions, such as asking for specific object descriptions, obstacles, etc., as the conversation is ongoing as the assistant does have the ability to keep context of the conversation.

How we built it

The frontend is a control panel for demonstration purposes built using TypeScript React. Computation in the backend is powered by Python and Flask, and uses external models from OpenAI for speech recognition (Whisper) and multi-modal conversation (GPT-4).

Challenges we ran into

Working with external services in a Hackathon are always quite challenging for the fact that we need to understand and familiarize ourselves with the services' usage in a short period of time, without running too many iterations that drive us into bankruptcy.

Prompt-instructing GPT-4 to output the desired description was inherently challenging as current knowledge on LLMs mostly view it as a black box, where improvement is made via many trials and errors.

Another very time-consuming task was choosing a tool to perform speech recognition. We had to account for many aspects such as accuracy, cost, robustness to noise, and ease of implementation. This ended up being the most time consuming task of our entire project.

Lastly, given the time constraint, putting all our ideas together was tough and required dynamic adjustments of the project plan and tasks since we had to evaluate if every feature was feasible and useful.

Accomplishments that we're proud of

We are proud to be able to convey our end product from start to finish, regardless of the obstacles we encountered.

What we learned

We learned that the possibilities with a pre-trained AI model, such as ChatGPT, are endless. Ideas for building powerful tools to assist individuals requiring accessibility is definitely not new, but the increase in inference power from machine learning models will act as bridge, bridging hardwares like camera and microphones to softwares, like traditional web backend, to create previously unseen powerful tools.

If we were able to build something like this within 24 hours, institutions with more time and ressources may be able to build even more powerful tools to help more people.

What's next for My A-EyE

There are so many exciting potential points of improvement for small, individual systems like this. Hardware wise, this project had the ultimate goal of being made into a wearable system, such as a pair smart sunglasses. Software wise, there could be much more features such as expanding features such as keeping historical conversations or object movement tracking. Lastly, many more explorations could be done on the machine learning part of the project, such as using other alignment methods like fine-tuning, trying out smaller multimodal models such as the Mistral 7B or Gemini Nano.

Built With

Share this project:

Updates