Inspiration
On Youtube, there are many videos where we remix speeches of celebrities and politicians to create music. However, most of this is done manually, so we wanted to create an app which would allow anyone to do this.
What it does
- Have something on your mind you would like to say out loud? Enter it!
- Upload a reference tune or recording. Get creative.
- Convert (and maybe grab a coffee).
- Enjoy.
How we built it
Audio to Text
Using an audio file of a speech (or song), we build a dictionary of recordings for each unique word said in the speech, with Google's web speech API. The dictionary is the compared against the user's entered text. Essentially, the user can remix the speech in whatever order they want to manipulate what the person says.
Pitch Maipulation
We utilised scipy.fft to perform Fast Fourier Transform of the waveforms, so as to identify the dominant frequencies. Then we used torch_pitch_shift to manipulate the waveforms to the desired pitch.
Frontend
We utilised Django as well as HTML and CSS to create a simple web app for displaying our project.
Challenges we ran into
There were many challenges encountered throughout this project which reduced its scope. Initially we would have liked to include video with the remixing, but this proved to be too finicky given the time constraints. Apart from that, we had to remove the music tuning approach and turn it into simply a speech remixer as the music tuning approach was too inefficient to run.
Furthermore, the use of a lot of python as certainly helped simplify some of our workflows, but connecting it with other interfaces, such as Django ad HTML was more tedious than expected.
Accomplishments that we're proud of
We're proud of using API's, libraries and technologies we've never used or rarely used. The fact that we are able to segment a 10 minute audio in 4 minutes is also a decent feat of engineering as well.
What we learned
Being decisive is an important skill that we will have developed over time. This is important in terms of getting our project settled and starting development early, selecting features that are feasible yet intellectually satisfying to implement, and choosing the right platforms and tools for building our project.
What's next for Voice Scrabble
Here’s what’s on our road map:
- Conversion of videos in addition to audios: It’s always funny to see the singer or speaker’s voices distorted, in action
- Web interface: A more user-friendly drag and drop upload UI for files, as well as more intuitive loading animations for longer tasks, such as uploading and conversion.
- More efficient segmentation algorithms: Split the videos into smaller frames, so as to allow more syllables to be generated. This is important considering that there are early 15k syllables in the English language.
Log in or sign up for Devpost to join the conversation.