Sight

Sight UI chat demo, guiding the user towards the bottle
Person and Bottle in front of camera, blocking the users path
LLM and Computer Vision UI elements

Inspiration

Imagine a world where the visually impaired don't just recognize objects but also understand their real-time position and relevance. With Sight, we blend the power of OpenCV, the expressiveness of advanced language models with Cohere, and the speed of Pinecone's vector database. We're not just creating an assistant; we're bridging the visual gap, ensuring those without sight can interact seamlessly with their environment in real time!

What it does

1) Real-time Object Detection: Identify and point out objects in the camera's view instantly. 2) Descriptive Guidance: Leveraging the capabilities of large language models, get descriptive and contextually appropriate guidance. 3) Intuitive Feedback: Through feedback, users can understand the environment better and navigate safely. 4) Scalable and Efficient: With Pinecone's vector database, ensure smooth and efficient data retrieval for a better user experience.

How we built it

Sight was built by integrating a LLM with a camera. The user interface was built using Streamlit, which offers a seamless transition from Python to front-end UI. Here's a step-by-step breakdown of our process:

1) Image Capturing & Object Detection: Every time a query is posed, the system captures a real-time image using the camera. This image is then processed by the YOLO model to identify objects present within. 2) Positioning Data: After detection, each object's position is determined and categorized as either left, right, or center. 3) Interaction with LLM: The positional data, along with object detection, is incorporated into an LLM template. For our LLM requirements, we employ Cohere's command-nightly. 4) Integration with Cohere API: Our system integrates with various Cohere APIs. Particularly, we utilize the embed function and, when necessary, the rerank function, both of which are found in our helper.py file. 5) Chat Session Storage: Every chat session initiated is saved into Pinecone's vector database. This becomes an invaluable resource if we decide to fine-tune Sight based on frequently posed queries. 6) Memory Management: We have incorporated Langchain's Memory buffer to conserve memory during ongoing conversations. This ensures that the system remains responsive and efficient. 7) Vector Dataset Generation: We leverage Streamlit's session key to generate a vector dataset. This is achieved by integrating Cohere's embeddings via the upset API.

In essence, Sight is a fusion of advanced machine learning, object detection, and intuitive UI, all working to provide an enriching user experience.

Flow of control: User types message to LLM -> Click Real time image -> Object detection and data positions - > LLM (prompt and answers) -> Storage in Vector dataset

Challenges we ran into

Our initial goal was to build Sight into wearable tech with complete voice communication. However, we ran into several challenges throughout the weekend, Challenges like faulty SD card made our Jetson nano unusable. Our team spent many hours trying to fix it with the help of hardware mentors to no avail. Moreover, shortage of microphone at the hardware area made it hard for us to use voice for communication.

While it was difficult to find a way around these challenges, we quickly pivoted towards researching for solutions like using Streamlit for the UI demo of Sight and focusing on making the experience with LLMs as useful as possible for the user.

Accomplishments that we're proud of

We are proud to have our product demo working and our excited by the potential applications of LLMs with Computer Vision. We are happy that we found work arounds to most challenges we faced in this fast paced environment and learned so much about the new technology we worked with. Finding an interesting idea, the right tools for a product and studying them in the time we had, all the while making sure we had fun, networked with sponsors and enjoyed the workshops and events has made the event a memorable one for us.

We also had very less resources when it came to working with LLMs and Computer Vision models. to the best of our knowledge this is a novel approach and we are glad to have completed the project with great learnings. We hope our product can make the world better, especially for the visually impaired.

What we learned

Over the last 36 hours, we spent significant time learning about the tools that has made Sight come to life. In both spaces: Computer Vision and LLMs. We had our team members focus on tools we were passionate about. Our most significant learnings of tools and libraries are listed below: 1) Data extraction and positioning from YOLO to feed to LLM 2) Cohere embeddings and Cohere APIs for LLM control 3) LangChainAI for building LLM pipeline and memory management 4) Streamlit library for chat front end and localhost server 5) Pinecone vector dataset for vector store of Cohere embeddings. And so much more!

What's next for Sight

Sight is just getting started. We aim to build a complete end to end tool to help people without vision, walk. One of our team members has already started working on tracking user's path so it can help them retrace back to the source. This is especially helpful if a person wanders to an unexplored area and can guide them back to their original place.

Aside from helping the Visually Impaired - Sight can act as an an umbrella to multiple other applications: 1) Finding your parked car while leaving the mall 2) Guiding kids find their parents when lose at the market 3) Recognize and name familiar faces in a gathering or event. 4) Aid in reading out loud from books, magazines, or other printed materials.