Inspiration💡

In the ever-evolving landscape of generative AI models, our journey with Audiofy began as a quest to bridge the gap in spatial audio generation. Witnessing the remarkable advancements in generative audio and music, we envisioned a novel approach—empowering users to transform an image and a text prompt into an immersive Dolby 5.1 surround sound experience. This is our submission for the 'ML+Art' category.

What it does🔎

Audiofy is your creative companion. Provide an image of a scene paired with a text prompt, and watch as we craft a 10-second Dolby 5.1 surround sound clip that harmonizes seamlessly with your input. With a user-friendly interface, you can effortlessly download the final WAV file.

How we built it🔨

Audiofy Machine Learning Model

Our project unfolds in two distinct phases:

Traditional Image Processing

  • Initial image processing ensures uniform dimensions for effective diffusion.
  • Employing ViT H, we perform image segmentation, selecting the three most prominent objects by area size.
  • ViT L comes into play to produce a detailed depth map.

Diffusion Models

  • Three key diffusion models shape our creative pipeline:
    1. Encoder-only model transforms inputs to latent space.
    2. Text diffusion leverages a transformer VAE for textual input.
    3. CoDi-inspired audio diffuser generates spatial audio.

Simulation and Output

  • A shoe box simulation using pyroomacoustics simulates 3D sound sources with corresponding coordinates.
  • Leveraging the power of ffmpeg, we generate Dolby 5.1 surround sound WAV files, bringing our vision to life.

Audiofy Web App

The Audiofy web app, crafted on Slidesmart, delivers a visually appealing and interactive user experience, complete with tabs and columns for seamless navigation.

Challenges we ran into⚠️

  • In the dynamic landscape of generative AI, keeping pace with advancements proved challenging. Extensive research involved reviewing 20-30 research papers to inform our project decisions and maintain novelty.
  • Diffusion models encountered hurdles with segmented images. To mitigate this, we resorted to cropping images and incorporating original inputs to prevent the generation of erroneous values.
  • Managing diverse input and output spaces across various models posed a significant challenge. Overcoming this required implementing a range of techniques to ensure seamless collaboration.
  • Limited options in Python for spatial audio libraries compelled us to utilize ffmpeg to produce the final output, deviating from the initially planned approach.

Accomplishments that we're proud of🥇

  • Successfully integrated machine learning models into a user-friendly UI, ensuring a seamless and intuitive user experience.
  • Fostered a collaborative learning environment within the team. Team members, regardless of their background, exchanged knowledge, with some mastering machine learning techniques and others excelling in UI design.
  • Pioneered a technique to generate Spatio Temporal Audio from both iconic and non-iconic images, expanding the scope of audio generation possibilities.
  • Introduced guidance to the music generation process using text prompts, adding a layer of creativity and personalization to the generated audio.
  • Successfully simulated and produced a 6-channel surround sound audio, achieving our goal of delivering an immersive auditory experience.

What we learned🧠

  • For most team members, this hackathon marked their first online hackathon experience. Navigating the challenges of remote collaboration and online execution enhanced our adaptability.
  • Despite being second-year students at the University of Toronto, the project provided an opportunity to delve into machine learning techniques and modern models. The diverse learning experiences enriched our skill set.
  • Collective experience and teamwork played a pivotal role in overcoming challenges. The belief in collaborative efforts empowered us to complete the project within the allocated time frame.

What's next for Audiofy💭

  • Conducting extensive experiments to identify the optimal combination of images and text prompts for superior audio generation.
  • Exploring the potential benefits of incorporating newer models, such as CoDi 2, to enhance the generative capabilities of Audiofy.
  • Subjecting the model to rigorous testing to address exceptional and edge cases, ensuring its readiness for potential research applications.

Built With

Share this project:

Updates