Inspiration

We partly took inspiration from self-driving cars, but especially from echolocation. The way AudiNavi works is different from echolocation, but the spirit is the same.

What it does

AudiNavi uses live video from a phone to provide audible feedback telling a blind person where walls and objects are in the area around them. In essence, it acts as a blind cane but is compact, discreet, and simple to use. With a camera at waist level, AudiNavi provides live information in the form of stereo/mono audio to contextualize the locations of all obstacles in the camera FOV.

How we built it

There are a few main components to the system:

Frontend

  • Node.js:

    • Serves as the foundation for the frontend and backend integration, providing a runtime environment for the server side of our web application. It manages the server requests and routes between the user interface and AI-powered backend.
  • React:

    • A JavaScript library used to build the user interface. It allows users to upload images and receive real-time feedback regarding obstacles in the environment. React’s component-based architecture makes the UI highly modular and dynamic, updating the results on the page as new data is processed by the backend.

Backend

  • Intel AI MiDaS:

    • MiDaS is a deep learning model provided by Intel's AI Lab, which performs monocular depth estimation. This means it can predict the depth of objects in an image using just one frame. We implemented MiDaS for our depth map generation, which is then used to identify the relative distances of objects from the user.
    • We used MiDaS’s DPT-Hybrid model, which balances speed and accuracy, making it an ideal fit for real-time applications.
  • CuPy K-Mean Clustering:

    • For grouping spatially and depth-similar objects, we employed K-means clustering, a powerful unsupervised machine learning technique. By leveraging CuPy, which provides a GPU-accelerated implementation of NumPy-like operations, we can perform clustering efficiently on large datasets.
    • The clusters represent objects that are grouped based on their spatial proximity and depth value, making it easier to identify nearby obstacles that the user should be aware of.
  • Manually Tuned Manipulation of the Distance Map:

    • After depth estimation and clustering, we implemented a post-processing step where the distance map is manually tuned. This step involves several operations, including a few of the below:
    • Centroid Scoring: Objects are weighted based on size, depth, and their position in the frame (e.g., objects lower in the image are given more importance).
    • Convolution Smoothing: To provide more consistent results, we applied convolutional smoothing over the depth map, ensuring that noise and sharp transitions are minimized.
    • Edge Protection: Custom logic ensures that important clusters (objects) near the edges of the screen aren’t excluded or incorrectly diminished.

Challenges we ran into

Getting the program to work consistently in a few cases, and especially when the camera was close to a wall. Git was also surprisingly painful - we mismanaged commits and merges at the start, and it came back to bite us (we had to reset the repo and port the files in directly, losing all the commit history).

Accomplishments that we're proud of

We're proud of getting this project to work at the level we got it to! It's certainly not perfect, but it's a big step in the right direction. It was a surprisingly creative process thinking of different ways of encoding spatial information into audio, and we're excited with the methods we came up.

What we learned

We learned a lot about various vision libraries, their models and preprocessing functions, get and post requests, Git and GitHub, and general web development! We also put a lot of effort into optimizing our code, because of how quickly the entire process had to run in order to provide timely information.

What's next for AudiNavi

Hopefully, we can further develop this project, and instead of computing on the cloud, being able to compute locally would be great. Another is to assign specific sounds to specific objects and be able to distinguish objects at different heights. There are a few cases (for example, being close to a wall) where the behavior is not optimal - covering those cases is yet another priority. We're excited to see where this project goes!

Share this project:

Updates