Inspiration💡
In the ever-evolving landscape of generative AI models, our journey with Audiofy began as a quest to bridge the gap in spatial audio generation. Witnessing the remarkable advancements in generative audio and music, we envisioned a novel approach—empowering users to transform an image and a text prompt into an immersive Dolby 5.1 surround sound experience. This is our submission for the 'ML+Art' category.
What it does🔎
Audiofy is your creative companion. Provide an image of a scene paired with a text prompt, and watch as we craft a 10-second Dolby 5.1 surround sound clip that harmonizes seamlessly with your input. With a user-friendly interface, you can effortlessly download the final WAV file.
How we built it🔨
Audiofy Machine Learning Model
Our project unfolds in two distinct phases:
Traditional Image Processing
- Initial image processing ensures uniform dimensions for effective diffusion.
- Employing ViT H, we perform image segmentation, selecting the three most prominent objects by area size.
- ViT L comes into play to produce a detailed depth map.
Diffusion Models
- Three key diffusion models shape our creative pipeline:
- Encoder-only model transforms inputs to latent space.
- Text diffusion leverages a transformer VAE for textual input.
- CoDi-inspired audio diffuser generates spatial audio.
Simulation and Output
- A shoe box simulation using pyroomacoustics simulates 3D sound sources with corresponding coordinates.
- Leveraging the power of ffmpeg, we generate Dolby 5.1 surround sound WAV files, bringing our vision to life.
Audiofy Web App
The Audiofy web app, crafted on Slidesmart, delivers a visually appealing and interactive user experience, complete with tabs and columns for seamless navigation.
Challenges we ran into⚠️
- In the dynamic landscape of generative AI, keeping pace with advancements proved challenging. Extensive research involved reviewing 20-30 research papers to inform our project decisions and maintain novelty.
- Diffusion models encountered hurdles with segmented images. To mitigate this, we resorted to cropping images and incorporating original inputs to prevent the generation of erroneous values.
- Managing diverse input and output spaces across various models posed a significant challenge. Overcoming this required implementing a range of techniques to ensure seamless collaboration.
- Limited options in Python for spatial audio libraries compelled us to utilize ffmpeg to produce the final output, deviating from the initially planned approach.
Accomplishments that we're proud of🥇
- Successfully integrated machine learning models into a user-friendly UI, ensuring a seamless and intuitive user experience.
- Fostered a collaborative learning environment within the team. Team members, regardless of their background, exchanged knowledge, with some mastering machine learning techniques and others excelling in UI design.
- Pioneered a technique to generate Spatio Temporal Audio from both iconic and non-iconic images, expanding the scope of audio generation possibilities.
- Introduced guidance to the music generation process using text prompts, adding a layer of creativity and personalization to the generated audio.
- Successfully simulated and produced a 6-channel surround sound audio, achieving our goal of delivering an immersive auditory experience.
What we learned🧠
- For most team members, this hackathon marked their first online hackathon experience. Navigating the challenges of remote collaboration and online execution enhanced our adaptability.
- Despite being second-year students at the University of Toronto, the project provided an opportunity to delve into machine learning techniques and modern models. The diverse learning experiences enriched our skill set.
- Collective experience and teamwork played a pivotal role in overcoming challenges. The belief in collaborative efforts empowered us to complete the project within the allocated time frame.
What's next for Audiofy💭
- Conducting extensive experiments to identify the optimal combination of images and text prompts for superior audio generation.
- Exploring the potential benefits of incorporating newer models, such as CoDi 2, to enhance the generative capabilities of Audiofy.
- Subjecting the model to rigorous testing to address exceptional and edge cases, ensuring its readiness for potential research applications.

Log in or sign up for Devpost to join the conversation.