reverie34

Landing page of reverie34
Uploading mp3 to reverie34
Output generated by Deforum Stable Diffusion for "Fly me to the moon, Let me play among the stars"
Output image generated by Deforum Stable Diffusion for "Let me see what spring is like, On a, Jupiter and Mars"
Output image generated by Deforum Stable Diffusion for "In other words, hold my hand"
Output image generated by Deforum Stable Diffusion for "In other words, baby, kiss me"

Inspiration:

The Twisted Reality track caught our team’s eye from the start and nothing seemed more “twisted reality” to us than AI-generated art. Systems like DALL-E 2 that produce artwork from user-given text prompts intrigued us and we wanted to explore the potential of this kind of AI in our project. Since we were all music lovers, we decided to try converting song lyrics into text and passing them into a text-to-image AI model to create art based on the lyrics. Upon researching the subject, ambition took over - we decided to take our project a step further and create entire music videos by blending together the AI-generated artwork. The name reverie34 came about because our project essentially converts a mp3 to a mp4, using AI to create dream-like (hence “reverie”) visuals to accompany the audio.

What it does:

Our program uses natural language processing and image interpolation AI to convert spoken audio into a video that reflects what was said. Our program can create visuals to go with the lyrics of a song, graphics to accompany a speech, or even illustrations to bring an audiobook or poem to life.

How we built it:

The front end of our web app was built using React and the back end framework was developed with FastAPI. With Python, we use an OpenAI automatic speech recognition system called Whisper to convert spoken audio into segments of text split up by pauses and changes in the tone of the voice. After passing these text segments into Deforum Stable Diffusion, a text-to-image diffusion model, images are generated based on the segments and image interpolation is performed to create a video. After putting the text on the video and combining it with the original audio, it is delivered back to the front end, where the user can view and download the final product.

Challenges:

Each of us on the team took over different tasks for this project, including making the front end of the web application, creating the server for the back end, using Python for speech-to-text conversion and image diffusion, and much more. Our individual challenges mainly consisted of figuring out how to use the various AIs to do what we needed, particularly combining the functions of the speech-to-text and the text-to-image-to-video AI models to effectively create a speech-to-video generator. Since text-to-video conversion doesn’t quite exist publicly yet, our project topic delved into very new territory and that resulted in a lot of difficulties since we didn’t have many examples or tutorials to help guide us. Additionally, we came across a hardware challenge when some of the AI packages we wanted to use for text-to-image required more VRAM than we had on our laptops. For example, setting up Stable Diffusion was a struggle since it required a large amount of VRAM to run the AI. Fortunately, one of our team members was able to make use of the extremely powerful GPU in his PC so we could use Stable Diffusion and make progress on the project.

Accomplishments:

This project was not easy or simply by any means. When we first suggested the idea, it was so daunting that we almost dismissed it, thinking there was no way we could accomplish it in 36 hours. Nobody in our group had any prior experience working with natural language processing AI systems, let alone text-to-image/video conversion AI. We’d all be breaking new ground if we attempted this project. But at the end of the day, curiosity and ambition won out. Here at the end, we are all extremely proud of how far we’ve come and what we’ve managed to scrape together in the last day and a half. Tackling this massive challenge and pulling through to create a working application has definitely been one of the best learning experiences our team has ever had.

What we learned:

We’ve all learned quite a bit about the newly developing field of text-to-image/video conversion and some things about natural language processing as well. In addition, we all got to see firsthand how the hardware you use can greatly affect the extent of your coding power. But besides the technical skills we gained, we learned a lot about how to work well in a team and communicate our ideas. Without the high level of effort and communication that each team member put in during this project, we wouldn’t have been able to make what we did.

What’s next for reverie34:

Going into this project, we decided that even if we didn’t finish in 36 hours, we would try to continue developing it since it was a very interesting concept that would be cool to see through. As it turned out, we did end up finishing the project, but there are always improvements that can be made. Future work on reverie34 could include optimizing the processing flow from speech to video and perhaps finding ways to create an even more realistic and accurate visual interpretation of the audio passed into the program.