Inspiration
Many of us, at one point or another, have crossed paths with someone who experiences difficulty in social environments. It could be due to anxiety, speech related impediments, or some form of neurodivergence. For some of us, they could be family or close friends. Helping them is the inspiration for Cue.
What it does
Cue lets you practice conversation in a safe, low-stakes, realistic setting. As you talk with the AI, our model analyzes facial expressions in real time to infer emotion and adapt the dialogue, making responses sound more natural. ElevenLabs' speech-to-text and text-to-speech APIs support verbal conversation, while Gemini generates thoughtful, supportive replies.
How we built it
The frontend is built with React while the backend relies on Flask. The user's camera is accessed by OpenCV and captures their face while they speak to the app, which then gets fed into the model for it to infer the user's emotion as they speak. ElevenLabs' is used to convert the user's speech to text. In conjunction with the emotion our custom model predicted, we send the input and context to Gemini to generate a meaningful, tailored response. Gemini's response then gets translated back into speech with ElevenLabs' text-to-speech and is played for the user.
Challenges we ran into
It was difficult to find an accessible dataset that contained high-quality, annotated images of peoples' facial expressions. The dataset we used (link) for training was composed of images of people showing exaggerated facial expressions, which don't reflect the micro-expressions that people use when conversing. We also attempted to generate an image dataset from a database of video clips (link), but ran into some issues using these images during the training phase.
Accomplishments that we're proud of
- Designed a custom convolutional neural network (architectural definition, input specifics, batch normalization, weighted loss).
- Finishing this in 36 hours.
What we learned
- How to record and play audio using python.
- How to use PyTorch to build custom models and control how they work.
- More about reinforcement learning.
- How to use CUDA to optimize the preprocessing and training pipeline.
What's next for Cue
Cue would greatly benefit with feedback from licensed mental health and speech therapy professionals, who can help ensure the experience is safe, supportive for diverse user needs, and based on science. We also plan to implement another custom model for inferring emotion from the user's tone. This will provide further emotional context to Gemini, enabling it to further tailor its response.
Log in or sign up for Devpost to join the conversation.