Track: Touch Screen. Changing the way we interact with media
Inspiration
VidMorph is a platform providing a variety of AI-powered video customization tools, including face swapping, voice style transfer, speech synthesis, audio-to-audio translation, and lip re-syncing. We see many potential applications:
- Advertisements: creating personalized advertisements local to a new demographic and their language and cultural references.
- Films and content creation: scene alteration through VidMorph's speech synthesis. A better dubbing experience with VidMorph's audio-audio translation and lip syncing. Changing entire scripts, actors, and languages can also be done.
- Education: Breaking the language barrier by making lectures accessible with audio-audio translation. Creating new educational material with different teachers' faces and voices that different student demographics will relate to more
- Public service and News alterations
VidMorph possesses a strong competitive advantage over other AI video tools, and presents many novel ideas not yet seen in commercial products:
- End-to-end video generation platforms like Sora or PikaLabs would not support the level of control as VidMorph nor would it allow creation of content derivative from existing media. For example, it would not be possible to repurpose an advertisement with text-to-video.
- Other Video AI tools such as ElevenLabs can be limited in scope and not allow for customization of past content.
- VidMorph is the first of its kind to provide an easy-to-use platform for features like audio-audio translation and seamless lipsyncing, and changing voices to any arbitrary one.
What it does
See our video!
How we built it
Our frontend was built with React. Our backend was built with Express which sets up the servers and Google Cloud Platform which provided the CUDA environment and GPUs to train our models. After the user inputs the required video clips and photos, we establish a connection to a server spawned from a Virtual Machine housed on the Google Cloud Platform. This Virtual Machine, equipped with GPUs, is configured with the CUDA environment for the training and inference process. We do real-time training and inference and the outputted altered video is then downloaded back to the local computer to be displayed on the frontend React web app.
We created 5 models faceswap, voiceswap, language translation, text to speech, lipsync all to support 4 features. Each model took significant engineering, using transformer models,
To our knowledge, this is the only platform that offers a full suite of AI-powered video editing and synthesis tools that operates on a starting template video.
Most existing models tackle video synthesis from the ground up. However, AdSpace adopts a very different approach. Instead of doing Video Synthesis, AdSpace is the only model that currently takes in images as inputs and operates on existing videos for alterations.
Challenges we ran into
We ran into several issues running training and inference on our local machines. Despite having GPU access, most of us experienced lots of issues with CUDA incompatibility due to M1 and M2 processors. We also experienced difficulty integrating each of the models.
Accomplishments that we're proud of
Proud of being the first to develop a multi-modal video editing platform that operates on existing video data.
What we learned
Learned lots about Gen-AI, LLMs, and video synthesis. We also gained experience in cloud computing, networking, sending requests between servers, and navigating the realm of CUDA.
What's next for AdSpace
There is big potential for this project to enter the AR/VR space. We envision a future where 3D advertisements will fundamentally transform the landscape of digital marketing. There remains great potential in turning 2D video into 3D assets in VR and tying them together with novel animation.
Log in or sign up for Devpost to join the conversation.