Discribe

Example of summarization and transcription

Inspiration

Discord is used by many companies, students, or basically anyone to connect with their peers, friends, teachers, or coworkers. It's massively gained traction throughout the pandemic, now becoming an essential component of our lives. It's become so popular that even universities and schools are creating discord keys to pass on important information to students. Several people use it for hosting podcasts, company meetings, online lectures, etc. Being long-time users of discord, we felt that we could enhance everyone's experience by allowing people to record, save and get transcriptions of their voice calls. We wanted to leverage the power of AI to make do all of this effortlessly. Because the Whisper model from OpenAI supports various languages, it is more tolerant of speakers of other languages and promotes diversity online.

What it does

Discribe is a Discord bot that records, transcribes, and summarizes your Discord voice calls and sends them over to your chat, whenever you wish.

How we built it

None of us had made a discord bot before, but we knew of various technologies that could be used to code one. We eventually decided on discord.js since some of us had experience with JavaScript. Discord does not support recording audio in its API, but we found a solution involving streams that allowed us to record. Using ffmpeg, we were able to create mp3 files from these streams. After that, we experimented with OpenAI’s Whisper, an automatic speech recognition system. It was simplest for us to communicate with the AI using Python, so we used that. Once we could convert any .mp3 files into text, we used Co:heres API to create an NLP model that summarises text. We then combined the two models, to create a programme that summarises the .mp3 file. Lastly, we created a system for the JavaScript bot and Python program to speak to each other while both are running.

Themes

Discovery \ Normally we rely on video recorders and play back to store our calls, and this doesn't work since a voice recording will always have random junk in it that we don't want to see. If you've ever put a lecture recording on double speed to save yourself the boredom, you will know what we mean. New technologies such as Whisper are just becoming stronger and stronger at converting the voice into text, that it makes sense to apply them to a situation like this. Combined with Cohere's powerful language processing API, we can filter out anything you don't need to read, so a users time is saved and a discussion from the past is easier to recall.

Diversity \ Today, humans manually transcribe audio/video content on platforms like Discord, Instagram, and Reddit. These transcriptions are quite useful for those who are deaf or hard of hearing. People from countries with limited access to the internet are known to use transcriptions to consume media as the poor connection speeds are unable to load the video content. The functionality our bot provides can improve the lives of these people. A human is not always there to transcribe media, but an AI can always be there.

Challenges we ran into

Getting the individual pieces of the program working took much less time compared to figuring out how to get them to communicate. We tried many libraries designed to allow JavaScript to talk with Python, but none worked perfectly. We ended up creating our own solution using the functionality built into JavaScript. We also extensively tested the NLP text summarization models, as there were a lot of parameters in the model we had to tune to be able to get summaries that fit our use case.

Accomplishments that we're proud of

We are proud that we ended up with a product, in roughly 24 hours, that works exactly as we intended it to, right from an idea and converting it into a usable tool with plenty of utilities waiting to be added in the near future. Despite the numerous errors we faced, we never gave up and ultimately were genuinely satisfied with all of our efforts as a team.

What we learned

Hack the Valley being the first-ever hackathon that we have all participated in has taught us a lot, whether it be implementing new APIs and ML models for the first time or just handling failures. We went into this challenge doubtful that we would be able to write even a half-functioning program, but we surprised ourselves with how much we can get done if we put in serious effort. We learned to stay together and solve problems that we have never encountered in the past and our genuinely proud of the hard work all of us put in.

What's next for Discribe

Our main goal is to expand the use of our bot to other social media platforms, mainly Reddit. With some code refactoring, the bot can start transcribing videos that are posted automatically. We could achieve this scale using hosting services such as Microsoft Azure. To extend its utilities, we will be working to detect and extract useful and critical information such as key topics, phone numbers, addresses, or maybe we could even separate voices in the transcription, to give good context about who is speaking.