Inspiration

AI automates many services throughout society. How about automating your meetings?

Current advances in AI allow for realistic audio and video generation that can simulate yourself.

What it does

First, the program will join the meeting and listen to the meeting audio. Once it hears your name being called, it triggers the application to start the "standup"

Given a provided voice sample an image, this will generate a realistic video of you saying that text using the SV2TTS deep learning framework, which was developed by Google researchers.

Essentially, using a pretrained model, a 5 to 10 second a sample human voice, and a 10 to 20 word sentence, the AI model will generate a .wav file simulating the sample voice speaking the given sentence.

Then, we feed this audio file into either an image or a video of the sample person and lip-sync the audio with the video using Wav2Lip, which ultimately outputs a video file to use in the Zoom meeting.

Challenges we ran into

Integrating many different machine learning models, audio frameworks, and python libraries took a lot of time and we ran into many issues.

We also needed a lot of trial-and-error with trying to send video and audio from our program to the meeting software.

Accomplishments that we're proud of

Making everything run fast enough to run in real-time. We are happy that we got this to work. We might even use it on some of our meetings

Google Cloud

Our project would not have been possible without the excellent Google Cloud APIs. Specifically, we used the Speech-to-Text API. This is because our application will listen for your name as a prompt before starting the fake standup video stream. Because this is a real-time application, the speed and reliability of this API was crucial for this product.

Share this project:

Updates