About the Project

Inspiration

This project was inspired by the idea of bridging human emotion and machine understanding. We wanted to explore how AI could listen, see, and respond in real time — not just with words, but with empathy. The motivation came from observing how communication often lacks awareness of emotional tone in virtual and in-person interactions. Whether in healthcare, education, or customer service, the ability to sense and respond to emotion could redefine how technology supports people.

When we realized that Ray-Ban Meta Glasses had no developer APIs for live video or audio, we saw a challenge worth solving. By creating our own real-time developer pipeline from the glasses to our backend, we not only “jailbroke” their data flow (ethically and legally) but also opened the door to using wearable technology for emotion and behavior analysis.


What We Learned

This project taught us how multimodal AI systems come to life when different data streams synchronize. We learned to integrate:

  • Computer vision for facial expression analysis using a custom facial expression model
  • Audio processing and speech recognition through Deepgram WebSockets
  • Contextual summarization and tonality inference via Google Gemini
  • Voice synthesis through ElevenLabs for natural, emotionally adaptive playback

We also learned that building a real-time system isn’t just about speed — it’s about latency orchestration. Each millisecond counts, and handling asynchronous video, audio, and API calls taught us to manage concurrency, rate limits, and fault tolerance at scale.

On a higher level, we gained insight into how humans communicate beyond words. Emotions, pauses, microexpressions — these subtle cues became data points that AI could learn to understand.


How We Built It

We designed a multi-threaded, event-driven backend in Python that combines several key components:

  1. Video Capture & Emotion Recognition
    • Captured frames via a virtual camera stream from Ray-Ban Meta glasses.
    • Used OpenCV for frame extraction and preprocessing.
    • Sent frames to a RunPod-hosted facial expression model for emotion scoring across seven categories ( E = {e_1, e_2, ..., e_7} ), where each ( e_i \in [0,1] ).

Mathematically, each frame produced an emotion vector: [ \mathbf{E}_t = [\text{anger}, \text{disgust}, \text{fear}, \text{happiness}, \text{neutral}, \text{sadness}, \text{surprise}] ]

  1. Audio Capture & Transcription

    • Captured real-time microphone input through PyAudio.
    • Streamed audio chunks to Deepgram’s WebSocket endpoint for sub-second transcription.
  2. AI Summarization

    • Combined transcripts and emotion data into structured logs in Firebase.
    • Fed the aggregated context to Google Gemini 2.5 Flash, which extracted critical segments, emotional tone, and key moments.
  3. Voice Feedback

    • Generated spoken summaries using ElevenLabs voice synthesis.
    • Selected voice profiles dynamically based on emotional tone (e.g., calm, professional, energetic).
  4. Ray-Ban Meta “Jailbreak” Workflow

    • Streamed from glasses → phone → desktop → screengrab → virtual camera.
    • Created a virtual feed accessible via OpenCV, enabling computer vision on wearable-captured data.

Challenges We Faced

  1. No Developer Access for Ray-Ban Glasses:
    Meta’s ecosystem is closed. We built an entirely custom workflow — mirroring the phone app, capturing screen pixels, and feeding those frames into OpenCV as a virtual webcam — effectively creating a DIY API.

  2. Concurrency & Synchronization:
    Processing multiple asynchronous streams (video, audio, and transcription) meant constant synchronization headaches. We used queues, locks, and async workers to maintain real-time alignment.

  3. Latency Management:
    Emotion detection, speech recognition, and Firestore updates each introduced small delays. Keeping the system under ~2 seconds of total end-to-end lag required optimizing frame intervals, API batching, and threading.

  4. API Rate Limits & Failures:
    When generating summaries or voiceovers, we encountered Gemini rate limits and ElevenLabs throttling. Implementing exponential backoff and caching became crucial.

  5. Human-Centered Design:
    Translating technical results (like numeric emotion vectors) into intuitive insights required designing metrics that mean something to users — like “engagement peaks” or “emotional consistency.”


Reflection

Building this system showed us that empathy can be engineered, not as a replacement for human understanding, but as an amplifier of it. The intersection of computer vision, speech analysis, and generative AI revealed a powerful truth:

Machines don’t need to feel emotions — but they can learn to understand ours.

This project blurred the line between observation and understanding, and in doing so, it gave us a glimpse into a future where technology doesn’t just listen — it cares.

AI tools: ChatGPT & Claude

Built With

Share this project:

Updates