Inspiration

Public speaking is scary (we are experiencing it first-hand before our presentations), but for many who struggle with social cues, or people who aren't native to English, understanding exactly how their vocal tone is being perceived by others can be difficult, and it can also be embarrassing to ask for feedback from others.

Existing tools can tell what you say, but rarely how you say it. Further, many existing speech emotion AI models are trained primarily on small, homogeneous datasets (primarily from older male academics). The goal of this tool is to help create a baseline that can serve a diverse population.

What it does

This is a web application that lets users record their speech directly in the browser. The trained AI analyzes acoustic features of the voice to provide users with feedback consisting of:

  • An Overall Confidence Score: Generated by a Machine Learning model.

  • Actionable feedback: Letting the user know if they are speaking too fast or need to turn the volume up.

  • Some gamified elements: An XP system, some achievements to collect, and visually appealing congratulatory elements.

    How we built it

  • The Dataset: A dataset consisting of 3000 files extracted from the Mozilla Common Voice dataset, labelled as either high or low confidence, was initially used. This ensured that the foundational model was trained on a wide distribution of global accents, ages, and genders. To further improve accuracy, about 40 5-second files were added to the dataset to fine-tune the model for my conditions.

  • The AI Architecture: I used scikit-learn to train a Random Forest Classifier. For feature extraction, "librosa," a Python library, was used to calculate the mean MFCCs, basically extracting a standard-length acoustic summary of the audio regardless of length.

  • The Frontend: The UI was built using Streamlit. The design is simple and applies universal principles such as higher contrast, clean typography, and clear visual indicators for all feedback.

  • Calibration: As the dataset is still quite small, a custom calibration script for different users and microphones is essential. The file "record_finetune.py" uses dataset oversampling to fine-tune the model.

    Challenges we ran into

    The biggest hurdle was actually getting the model to work... It gave 85% accuracy when evaluating against the clean, high-quality files from the original dataset. However, when I tried to test it, the accuracy was bad. So I had to create the finetuning script.

    Accomplishments that we're proud of

    I'm just happy that I created an app that can actually be useful for many (obviously, after some polishing).

    What we learned

    The brutal reality of actually deploying a Machine Learning model. An AI model can have fantastic accuracy in a Jupyter Notebook, but the moment you try it in the real world, that accuracy means almost nothing without proper calibration techniques. It was also great to learn that you can completely bypass complex Node.js stacks by deploying your model with Streamlit.

    What's next for Confident Speaker

    Next steps could include building a full web app with a robust back-end and adding more persuasive tech features to make the app more enjoyable to use. The finetuning also needs to be implemented directly into the app. And finally, expanding the dataset further would definitely improve the model's generalisability and reduce the need for finetuning.

Built With

Share this project:

Updates