Seeing (plus plus)
Grasp Assist
Left to Right: Ilya, Aiden, Laurie
The power supply
The User

Seeing Eye Camera

Inspiration

Our team wanted to apply our skills in computer vision and AI to hardware that could make a meaningful improvement to someone's life.

What it does

This assistance toolkit has features like object detection, face recognition, and movement guidance similar to a walking stick. The user interface consists of natural human language and gestures. The framework is modular and easy to expand, so additional functionality can be easily added for a variety of needs.

How we built it

This project uses several nodes, connected by ROS, that perform a variety of tasks. We have an audio interface powered by speech recognition and a TTS engine, and optionally a local LLM for natural language responses. There is a vision node for object and face recognition, and hand tracking to help identify relevant objects. When used with an RGB-D camera like the RealSense, the depth information is used with some geometric calculations to find objects in 3D space, and to use hand tracking as a virtual "walking stick" with range of up to 10 meters, or as little as 5 cm.

Challenges we ran into

We decided early on to run the code bare metal on the Nvidia Jetson. Since the Jetson is small and light, and uses little enough power to run off a battery, it offered the best balance of performance, reliability, and portability. However, we ran into numerous dependency issues and hard incompatibilities (such as the IMU of the RealSense) that required unintuitive workarounds.

We also decided we wanted everything to run on-device, which limited our options for what models we could use, in terms of being locally availiable and fitting within hardware constraints. The LLM integration proved the biggest challenge, as the rest of the program left us with little resources for anything but the smallest model (llama 1b params), which produced unreliable output. In our testing, even a 3b parameter model would have worked great, but we just barely didn't have enough memory to run it with the rest of the program.

A reliable and intuitive user interface also proved difficult. For audio processing, we had to balance speed and reliability, while also having it ignore background noise and speech. We settled on a familliar digital assistant interface with a wakeup command, and a direct speech-to-intent processing model rather than relying on transcriptions.

The hand geometry was another challenge. While MediaPipe can get an accurate hand skeleton, estimating that skeleton's position in reference to the camera using a noisy depth reading from the realsense was unreliable, and the ray cast by the finger was difficult to stabilize. However, some clever perspective geometry tricks turned this into a useful tool and a highlight of the project.

Accomplishments that we're proud of

Despite all the challenges, we created a product that could assist people today. The virtual "cane" node, using true depth data from an RGBD camera and reliable MediaPipe hand tracking, is likely the most useful as-is. It can not only help visually impaired people, but also be a general tool in industry needing to quickly estimate distances and sizes (such as in construction). The object and face recognition, and the audio interface also offer a glimpse of what assistive technology may look like in the future.

We are also proud that all processing happens locally on-device. No user data ever leaves the wearable computer, respecting privacy and autonomy. It also means the device has no dependence on an internet connection, which would be unacceptable in terms of reliability. The system is entirely self contained and portable.

What we learned

This project taught us a lot about working with vision and language models in real life, especially under resource constrains. Getting the most value of of them efficiently, rather than simply dumping all the information we have into a VLM, was the most challenging and the most rewarding. We also experienced creating wearable technology, and how to balance comfort and practicality of the setup with the positioning of the camera to get the most use out of it.

What's next for SeePlusPlus_SeeingEyeCamera

We would explore making the user interface more streamlined and intuitive, as this is something we could expect a user to frequently use and get comfortable with. While the current interface is usable and great for a demo, it may feel overly verbose to an experienced user. Doing a proper user study would reveal what the best balance is.

Due to the project's modular nature, continued developmment would primarily be coming up with more and better modules for assistive tools. Consulting with people with a variety of needs (even beyond impaired vision) would guide us on how best to assist them. We would also like to try some more powerful models (especially on the language side) to see what this technology is capable of in full.