Inspiration

In business meetings it's often required or just helpful to have a transcript of the meeting afterwards. Modern tools allow to automatically transcribe meetings using Speech-to-Text services but usually lack the information who said what.

This issue also arises in different scenarios. Another potential use case would be for example in a D&D play group a transcription of what happened could be helpful for the next session.

What it does

By identifying who's speaking TranscribeAId automatically tags sentences spoken with their speaker's name. Easy and simple. Well that's what is planned. Right now we have an application for live speech-to-text transcribing and a script to postprocess the resulting txt file.

How we built it

Using a combination of Javascript for a prototype web app and Python scripts which utilize Speechbrain for speaker recognition and diarisation as well as Google Speech-to-Text APIs we can tie all together and make it happen.

Challenges we ran into

Speaker recognition and diarisation is a tough topic and quite fresh in the fields of ML and AI. Resources are sparse and the few libraries and scripts capable of doing it are hard to get running. Since there is not enough time for data prepping and model teaching, we had to find solutions with already trained models. Existing, stable APIs turned out to be beta or limited to business customers and thus not available to us.

Accomplishments that we're proud of

After realizing how big of a project the machine learning part is, with a lot of error, research and despair, we stuck to it and managed to complete our first Hackathon and a great prototype.

What we learned

Machine Learning in the field of voice recognition is hard and takes more time than anticipated. We did expect a long learning time, especially with sound, but it did surprise us how long an e.g. 5 minute conversation takes to process. We did expect to achieve more with the AI tools, yet we appreciate the learning experience.

What's next for TranscribeAId

TranscribeAId is just a PoC as of now. But integrating a solution like ours into commonly used meeting applications could make it really easy to train a robust model capable of doing this job. By consuming voice data associated with a user's name the model can passively listen to meetings and train itself constantly while providing the transcription at the same time. Once trained well enough, it could easily be used within non-remote meeting scenarios as well.

Built With

Share this project:

Updates