Audiofy

Inspiration💡

In the ever-evolving landscape of generative AI models, our journey with Audiofy began as a quest to bridge the gap in spatial audio generation. Witnessing the remarkable advancements in generative audio and music, we envisioned a novel approach—empowering users to transform an image and a text prompt into an immersive Dolby 5.1 surround sound experience. This is our submission for the 'ML+Art' category.

What it does🔎

Audiofy is your creative companion. Provide an image of a scene paired with a text prompt, and watch as we craft a 10-second Dolby 5.1 surround sound clip that harmonizes seamlessly with your input. With a user-friendly interface, you can effortlessly download the final WAV file.

How we built it🔨

Audiofy Machine Learning Model

Our project unfolds in two distinct phases:

Traditional Image Processing

Initial image processing ensures uniform dimensions for effective diffusion.
Employing ViT H, we perform image segmentation, selecting the three most prominent objects by area size.
ViT L comes into play to produce a detailed depth map.

Diffusion Models

Three key diffusion models shape our creative pipeline:
1. Encoder-only model transforms inputs to latent space.
2. Text diffusion leverages a transformer VAE for textual input.
3. CoDi-inspired audio diffuser generates spatial audio.

Simulation and Output

A shoe box simulation using pyroomacoustics simulates 3D sound sources with corresponding coordinates.
Leveraging the power of ffmpeg, we generate Dolby 5.1 surround sound WAV files, bringing our vision to life.

Audiofy Web App

The Audiofy web app, crafted on Slidesmart, delivers a visually appealing and interactive user experience, complete with tabs and columns for seamless navigation.

Challenges we ran into⚠️

In the dynamic landscape of generative AI, keeping pace with advancements proved challenging. Extensive research involved reviewing 20-30 research papers to inform our project decisions and maintain novelty.
Diffusion models encountered hurdles with segmented images. To mitigate this, we resorted to cropping images and incorporating original inputs to prevent the generation of erroneous values.
Managing diverse input and output spaces across various models posed a significant challenge. Overcoming this required implementing a range of techniques to ensure seamless collaboration.
Limited options in Python for spatial audio libraries compelled us to utilize ffmpeg to produce the final output, deviating from the initially planned approach.

Accomplishments that we're proud of🥇

Successfully integrated machine learning models into a user-friendly UI, ensuring a seamless and intuitive user experience.
Fostered a collaborative learning environment within the team. Team members, regardless of their background, exchanged knowledge, with some mastering machine learning techniques and others excelling in UI design.
Pioneered a technique to generate Spatio Temporal Audio from both iconic and non-iconic images, expanding the scope of audio generation possibilities.
Introduced guidance to the music generation process using text prompts, adding a layer of creativity and personalization to the generated audio.
Successfully simulated and produced a 6-channel surround sound audio, achieving our goal of delivering an immersive auditory experience.

What we learned🧠

For most team members, this hackathon marked their first online hackathon experience. Navigating the challenges of remote collaboration and online execution enhanced our adaptability.
Despite being second-year students at the University of Toronto, the project provided an opportunity to delve into machine learning techniques and modern models. The diverse learning experiences enriched our skill set.
Collective experience and teamwork played a pivotal role in overcoming challenges. The belief in collaborative efforts empowered us to complete the project within the allocated time frame.

What's next for Audiofy💭

Conducting extensive experiments to identify the optimal combination of images and text prompts for superior audio generation.
Exploring the potential benefits of incorporating newer models, such as CoDi 2, to enhance the generative capabilities of Audiofy.
Subjecting the model to rigorous testing to address exceptional and edge cases, ensuring its readiness for potential research applications.

Built With

matplotlib
numpy
opencv
python
pytorch
streamlit

Submitted to

ML Hacks
- Winner ML + Art Winner: Action Camera FHD 1080P

Created by

I worked on the audio processing aspects used to produce a Dolby 5.1 Surround Sound. I also worked on integrating the ML Model with the UI.

Shivesh Prakash
Computer Science @ UofT
I worked on all the machine learning aspects.

Rishit Dagli
Bhavya Bhatt
vedant swamy

Updates

Shivesh Prakash started this project — Feb 17, 2024 10:26 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.