Inspiration
Speech therapy, while a necessity for many kids who have autism or other speech disorders, costs over $100 per hour. During the course of the Covid 19 pandemic, inflation has reached a 40 year high and the Consumer Pricing Index (CPI) has reached a record 8%. Hence this makes speech therapy very expensive for many families and often can take a hit on the socioeconomic status of families. I personally have seen close family friends of mine being deeply concerned about their family's socioeconomic status/livelihood after having to pay for speech therapy every week for an Autistic children of theirs. Seeing this reality, we wanted to create a speech therapy/practice app that will not only help people with speech disorders but can also help ANYONE with speech anxiety. While our app's inspiration is for children with speaking disorders, speech anxiety and a lack of confidence in speaking not only pertains to children with identified speech disorders but also to people in general. It is stated by the National Library of Medicine that “Public speaking anxiety (PSA, also known as fear of public speaking, or the fear of speaking in public) is classified in the DSM-5 (Diagnostic and Statistical Manual of Mental Disorders; American Psychiatric Association) as a social anxiety disorder. It is reported as prevalent in 15% to 30% of the general population.” Speech anxiety severely affects the daily lives of people with and without speech disorders greatly. Speech Splendid addresses this by giving valuable semantic and expressional feedback, which can be used to improve speech confidence and performance. The goal for our app is to give the gift of expression through speech person by person.
What it does
SpeechSplendid utilizes AI/ML techniques to analyze numerous facets of speech. The two components of public speaking that SpeechSplendid targets are content and delivery, and the feedback provided revolves around these two aspects of speech. The user is first presented with simple syntax results. Filler words and hedging language are detrimental to the perceived “confidence” of a speech, so the frequency of these types of language is summarized to the user. The user’s words per minute is displayed along with the ideal range to help pace the speech. Sentiment analysis is performed on the content to gauge the speech’s underlying message and tone. The output is a compound score that takes into account positive, negative, and neutral components of the speech. The user can adjust their speech to be more clearly positive or negative depending on their goal—-for example, a user speaking on an issue they deeply detest would want to choose wording that is more negative, and would therefore want to achieve a lower compound score. Two other forms of feedback displayed are topic and behavioral information. Entities throughout the speech are stored as topics and subtopics, and these are outputted to the user along with behavioral characteristics—which result from more intricate semantic analysis. The topic report lets the user know what the perceived structure and focus of their speech is. Lastly, a deep convolutional neural network is used to output the most frequently observed facial expression, which helps the user align their facial movements with their speech’s content.
Technical Specifications
The high level workflow is 1) separate the speech recording into its audio and video components. The user’s file is stored temporarily and deleted after the program executes to preserve privacy. MoviePy is used to edit the video and extract the audio into a .wav file. 2) the audio is transcribed to text using IBM’s Speech-to-Text API. The text is then parsed for filler words, empty adverbs, and hedging language. Lists of these words were created through researching various dictionaries and public speaking forums. A percentage value is assigned to each category of fluff language that tells the user how much of their speech the specific type of language contributes to. Words per minute is calculated by splitting the transcript into a list of words and dividing said list’s length by the duration of the video recording. Depending on the ranges that the metrics fall into, tailored feedback is displayed on the frontend that gives tips on how to improve for the next attempt. 3) sentiment analysis is performed using the VADER analysis tool, which was chosen due to its proven efficiency on tasks such as this. Behavioral and topic analysis is done in part through expert.ai’s document analysis NLP API. For the facial analysis, a nine-layer deep convolutional neural network developed and trained by Facebook—DeepFace—is used to classify static frames of the video. OpenCV is used to handle video processing and frame sampling. Streamlit was used as the frontend framework due to its simple integration with backend ML models and Python.
Challenges we ran into
The most arduous problem was deciding which model architecture to use for the expression analysis. Using a more complex system of models would mean sacrificing the number of frames analyzed. Originally, we used the Python Facial Expression Analysis Toolbox (py-feat) which allows us to import pretrained transfer learning models to use for frame analysis. Each frame would be run under 5 models, with 4 of which being deep neural networks. The models performed face detection, alignment, landmarking, muscle contraction analysis, and emotion analysis. While the added facial detention did slightly improve analysis, the large amount of computations done on each frame limited the amount of frames that we could analyze from the video, since we had to balance efficiency with runtime and memory management. This complex system of models often crashed the app when hosted on Streamlit due to spikes in memory usage. The solution for this problem was exchanging our current approach in favor of a lighter model. The DeepFace model is a nine-layer CNN that was trained on over 4 million images, and it was developed by researchers at Facebook. The DeepFace system drastically decreased runtime while still maintaining a high level of accuracy. Additionally, using a lightweight model also enabled us to sample more frames from the speech video, improving the accuracy of our expression scores.
Accomplishments that we're proud of
SpeechSplendid reveals how automated public speaking coaching can be just as effective and intuitive as rehearsing with an actual person. Our app provides some of the most extensive feedback for people with limited access to professional public speaking coaches or classes—which is most of the population. It is free, quick, and most importantly: simple. Semantics and linguistics as a whole can be confusing and overwhelming to a layperson, but SpeechSplendid is able to perform complex tasks such as document entity analysis and still convey this information to the user in a clean, succinct manner. Our app achieves its goal of being an accessible, straightforward speech-practice platform that nearly anybody can use to solidify their public speaking skills and boost their confidence.
What's next for SpeechSplendid
Features we would add in a 2.0 iteration: -Option for live recording -Graphs for facial expression trends throughout the video (for each frame) -Feedback for tone/behavioral analysis -Database and account system to keep track of progress -Practice activities to isolate and improve upon specific components of speech.
Built With
- expert.ai
- machine-learning
- natural-language-processing
- py-feat
- python
- streamlit
- vader
Log in or sign up for Devpost to join the conversation.