DeepStandup

How to Use
ER Diagram

Inspiration

AI automates many services throughout society. How about automating your meetings?

Current advances in AI allow for realistic audio and video generation that can simulate yourself.

What it does

First, the program will join the meeting and listen to the meeting audio. Once it hears your name being called, it triggers the application to start the "standup"

Given a provided voice sample an image, this will generate a realistic video of you saying that text using the SV2TTS deep learning framework, which was developed by Google researchers.

Essentially, using a pretrained model, a 5 to 10 second a sample human voice, and a 10 to 20 word sentence, the AI model will generate a .wav file simulating the sample voice speaking the given sentence.

Then, we feed this audio file into either an image or a video of the sample person and lip-sync the audio with the video using Wav2Lip, which ultimately outputs a video file to use in the Zoom meeting.

Challenges we ran into

Integrating many different machine learning models, audio frameworks, and python libraries took a lot of time and we ran into many issues.

We also needed a lot of trial-and-error with trying to send video and audio from our program to the meeting software.

Accomplishments that we're proud of

Making everything run fast enough to run in real-time. We are happy that we got this to work. We might even use it on some of our meetings

Google Cloud

Our project would not have been possible without the excellent Google Cloud APIs. Specifically, we used the Speech-to-Text API. This is because our application will listen for your name as a prompt before starting the fake standup video stream. Because this is a real-time application, the speed and reliability of this API was crucial for this product.

Built With

Submitted to

Bitcamp 2022
- Winner Best Razzle Dazzle Hack - Bitcamp

Created by

Fought with python package environments for 12+ hours to ensure our application was cross-platform compatible. Also implemented the audio + video/image --> deep fake pipeline using Wav2Lip.

Ryan Downing
Built out integration between ML models and piping video/audio to system output.

Peter Geertsema
Worked with the Speech-to-text API to detect when to start the deep faked "standup"

Abhinav Modugula
Evan Wang