Speech Master

Inspiration

The post-COVID era has increased the number of in-person events and need for public speaking. However, more individuals are anxious to publicly articulate their ideas, whether this be through a presentation for a class, a technical workshop, or preparing for their next interview. It is often difficult for audience members to catch the true intent of the presenter, hence key factors including tone of voice, verbal excitement and engagement, and physical body language can make or break the presentation.

A few weeks ago during our first project meeting, we were responsible for leading the meeting and were overwhelmed with anxiety. Despite knowing the content of the presentation and having done projects for a while, we understood the impact that a single below-par presentation could have. To the audience, you may look unprepared and unprofessional, despite knowing the material and simply being nervous. Regardless of their intentions, this can create a bad taste in the audience's mouths.

As a result, we wanted to create a judgment-free platform to help presenters understand how an audience might perceive their presentation. By creating Speech Master, we provide an opportunity for presenters to practice without facing a real audience while receiving real-time feedback.

Purpose

Speech Master aims to provide a practice platform for practice presentations with real-time feedback that captures details in regard to your body language and verbal expressions. In addition, presenters can invite real audience members to practice where the audience member will be able to provide real-time feedback that the presenter can use to improve.

While presenting, presentations will be recorded and saved for later reference for them to go back and see various feedback from the ML models as well as live audiences. They are presented with a user-friendly dashboard to cleanly organize their presentations and review for upcoming events.

After each practice presentation, the data is aggregated during the recording and process to generate a final report. The final report includes the most common emotions expressed verbally as well as times when the presenter's physical body language could be improved. The timestamps are also saved to show the presenter when the alerts rose and what might have caused such alerts in the first place with the video playback.

Tech Stack

We built the web application using Next.js v14, a React-based framework that seamlessly integrates backend and frontend development. We deployed the application on Vercel, the parent company behind Next.js. We designed the website using Figma and later styled it with TailwindCSS to streamline the styling allowing developers to put styling directly into the markup without the need for extra files. To maintain code formatting and linting via Prettier and EsLint. These tools were run on every commit by pre-commit hooks configured by Husky.

Hume AI provides the Speech Prosody model with a streaming API enabled through native WebSockets allowing us to provide emotional analysis in near real-time to a presenter. The analysis would aid the presenter in depicting the various emotions with regard to tune, rhythm, and timbre.

Google and Tensorflow provide the MoveNet model is a large improvement over the prior PoseNet model which allows for real-time pose detection. MoveNet is an ultra-fast and accurate model capable of depicting 17 body points and getting 30+ FPS on modern devices.

To handle authentication, we used Next Auth to sign in with Google hooked up to a Prisma Adapter to interface with CockroachDB, allowing us to maintain user sessions across the web app. Cloudinary, an image and video management system, was used to store and retrieve videos. Socket.io was used to interface with Websockets to enable the messaging feature to allow audience members to provide feedback to the presenter while simultaneously streaming video and audio. We utilized various services within Git and Github to host our source code, run continuous integration via Github Actions, make pull requests, and keep track of issues and projects.

Challenges

It was our first time working with Hume AI and a streaming API. We had experience with traditional REST APIs which are used for the Hume AI batch API calls, but the streaming API was more advantageous to provide real-time analysis. Instead of an HTTP client such as Axios, it required creating our own WebSockets client and calling the API endpoint from there. It was also a hurdle to capture and save the correct audio format to be able to call the API while also syncing audio with the webcam input.

We also worked with Tensorflow for the first time, an end-to-end machine learning platform. As a result, we faced many hurdles when trying to set up Tensorflow and get it running in a React environment. Most of the documentation uses Python SDKs or vanilla HTML/CSS/JS which were not possible for us. Attempting to convert the vanilla JS to React proved to be more difficult due to the complexities of execution orders and React's useEffect and useState hooks. Eventually, a working solution was found, however, it can still be improved to better its performance and bring fewer bugs.

We originally wanted to use the Youtube API for video management where users would be able to post and retrieve videos from their personal accounts. Next Auth and YouTube did not originally agree in terms of available scopes and permissions, but once resolved, more issues arose. We were unable to find documentation regarding a Node.js SDK and eventually even reached our quota. As a result, we decided to drop YouTube as it did not provide a feasible solution and found Cloudinary.

Accomplishments

We are proud of being able to incorporate Machine Learning into our applications for a meaningful purpose. We did not want to reinvent the wheel by creating our own models but rather use the existing and incredibly powerful models to create new solutions. Although we did not hit all the milestones that were hoping to achieve, we are still proud of the application that we were able to make in such a short amount of time and be able to deploy the project as well.

Most notably, we are proud of our Hume AI and Tensorflow integrations that took our application to the next level. Those 2 features took the most time, but they were also the most rewarding as in the end, we got to see real-time updates of our emotional and physical states. We are proud of being able to run the application and get feedback in real-time, which gives small cues to the presenter on what to improve without risking distracting the presenter completely.

What we learned

Each of the developers learned something valuable as each of us worked with a new technology that we did not know previously. Notably, Prisma and its integration with CockroachDB and its ability to make sessions and general usage simple and user-friendly. Interfacing with CockroachDB barely had problems and was a powerful tool to work with.

We also expanded our knowledge with WebSockets, both native and Socket.io. Our prior experience was more rudimentary, but building upon that knowledge showed us new powers that WebSockets have both when used internally with the application and with external APIs and how they can introduce real-time analysis.

Future of Speech Master

The first step for Speech Master will be to shrink the codebase. Currently, there is tons of potential for components to be created and reused. Structuring the code to be more strict and robust will ensure that when adding new features the codebase will be readable, deployable, and functional. The next priority will be responsiveness, due to the lack of time many components appear strangely on different devices throwing off the UI and potentially making the application unusable.

Once the current codebase is restructured, then we would be able to focus on optimization primarily on the machine learning models and audio/visual. Currently, there are multiple instances of audio and visual that are being used to show webcam footage, stream footage to other viewers, and sent to HumeAI for analysis. By reducing the number of streams, we should expect to see significant performance improvements with which we can upgrade our audio/visual streaming to use something more appropriate and robust.

In terms of new features, Speech Master would benefit greatly from additional forms of audio analysis such as speed and volume. Different presentations and environments require different talking speeds and volumes of speech required. Given some initial parameters, Speech Master should hopefully be able to reflect on those measures. In addition, having transcriptions that can be analyzed for vocabulary and speech, ensuring that appropriate language is used for a given target audience would drastically improve the way a presenter could prepare for a presentation.