Inspiration

The world of online meetings and communications has always been one on a growth curve. The possibilities and potential requirements are endless beyond our comprehension. 

Many times have we been in meetings with low bandwidth, laggy, or noisy background coworkers. How great would it be if there could be something that could fit the glove and solve these problems?

How about the times we took music lessons over Zoom (COVID lockdown—remember that?) and our teachers would struggle managing noise suppression so we could hear him mid-song? The times our teacher would mock exam us, having to speedrun writing his feedback whilst you bolted ahead on that Bach prelude?

A program that could transcribe lip reading in real time would be revolutionary for these soooo frustrating cases, hence why we made just that. 

What it does

Currently only a web-app demo, intended to be a powerful plugin for any communication platforms, Silent Voice transcribes your camera video lip readings into real text. It is able to speak back to you and transcribe your "audio" in real time!

Silent Voice also builds upon the Symphonic API by using classification ML datasets to determine the reasonability of what the API determines each phrase to say. This way, it can account for potential user usage errors (not annunciating, standing too far back, etc.) and potential errors with the ML processing on the server side of the API as well. This also strengthens overtime, reenforcing this semantic analysis, which can serve as data for future models (both our software and other lip-reading models).

How we built it

We built Silent Voice using a ReactJS + MUI front-end and Flask backend, with no real need for a database.

We utilized Symphonic's API to process MP4 files to determine text-based meaning out of them. (Lip reading).

We used Cohere's API for the semantic analysis and model building. This served as a way to reinforce Symphonic's API and provide a more nuanced and precise live transcription. 

We also had to use ffmpeg to convert webm files to mp4 (given that Symphonic's API relied on this file format) and other handy tricks to make the app run as smoothly as it does!

Challenges we ran into

We had trouble understanding the model symphonic used and the functionality behind it. Once we figured it out, however, it let the rest of the development process much more straight forward.

Another challenge we ran into was just the nuance of the API; as it is a model continuously in development, there are still some works in progress and rough spots. Working around these was a challenge (and it was actually what let us come up with the idea to look into semantic analysis on top of just using the API for reinforcement and training!)

Achievements that we're proud of

We are proud of getting our vision working as best as possible! We hope that as this technology develops further, things will become even smoother and this technology will be able to be more widespread.

What we learned

We learned a lot about ML and deep learning models, along with the usage of semantic analysis and tokenization. 

Learned a lot generally about AI too! 

What's next for Silent Voice?

We hope to transform this project into a plugin for communication platforms (perhaps another valuable tool in addition to noise cancellation technology!)

Built With

Share this project:

Updates