YoloLLM

Final Setup
The team and the mentors

YOLOLLM Team

Inspiration

We decided to work on the optical danger detection for people with visual impairments.

We believe that the combination of YOLO, LLMs and good quality Text To Speech technologies can deliver incredible results and solutions for real life problems.

What it does

The setup contains a Nvidia Jetson Nano with a USB webcam and headphones. A custom python program runs on the device, doing the following:

Continuously capture images/frames
Process the image frames using YOLO
In case an object is detected, it sends the frame to OpenAI Vision to get a simple description of what the frame contains
The received text is output using OpenAI Text-To-Speech

How we built it

We first discussed about one specific use case. We thought about a blind pedestrian walking on the sidewalk and potential dangereous situations that could come up
Based on the provided example script, we added connections to our other services
A fastapi service triggers the description and tts creation/play

Challenges we ran into

Jetson Nano + Setup issues: General dependency complications with different python environments and libraries.
Latency vs. Camera Input Quality: We probably don't use the proper full potential of the board due to lack of direct experience with it. The frames created from the camera are heavily affected by typical factors like lighting, resolution and compression, which can lead to less precise output in various cases.

Accomplishments that we're proud of

Getting our image processing pipeline to analyse the visual input to give feedback & warnings in real time!

What we learned

Aside from having to handle (and therefore learn about how the device works) all complications that arised during the implementation, we learned a lot about how much we can achieve with existing technologies. It is already possible to quickly come up with a solution for every-day problems. Within the 36hours we managed to create a working PoC of optical danger detection for people with visual impairments.

What's next for YoloLLM

Our little project can definitely be optimized to run smoother and react with more urgency in case of upcoming emergencies. The continous improvement of the hardware (and the optimization of the services used) will make the current setup way smaller and more accessible to everybody in the future. With the addition of dedicated processing units for AI a regular smartphone could soon be used instead of the Jetson Nano.

Built With

ai
docker
fastapi
jetson-nano
llm
openai
python
tts
yolo

Submitted to

Hackaburg 2024

Created by

My focus was to bring all the different code-parts together in a way that it works with the given environment.

Daud Taj
I supported mostly on the backend side of the project.

Daniel Montano
I mostly developed python scripts on the project.

Sina Bakhshandeh
Contributed on python scripts related to text to speech.

Hamed Noroozi
FloGre-UGB Greindl