Jarvis Eh?

A Real-Time AI Companion for People Living with Dementia, Built on Meta Glasses


Inspiration

Dementia doesn't take a person all at once. It takes them in moments: the moment they look at their granddaughter and can't find her name, the moment they forget what they were walking toward, the moment a conversation loops and they feel the confusion on every face around them.

Jarvis-eh? started with a simple question: what if someone's glasses could remember for them? Meta's smart glasses already sit on the face, already have a camera, already have speakers. They see exactly what the person sees. We wanted to make them into a quiet companion, one that never panics, and never makes the person feel like a problem to be managed.


What It Does

Jarvis-eh? streams a first-person video feed from Meta glasses to a Python backend, which processes every frame. Responses are delivered as natural audio through the glasses speakers, and the caregiver monitors everything through a live React dashboard.

Face Recognition + Memory Recall - When a familiar face appears, Jarvis matches it against the family photo album and whispers through the glasses:

"That's Sarah, your granddaughter. She came by last Tuesday and you watched a movie together."

Situation Grounding - Detects confusion through optical flow head-scanning analysis, then orients the patient with a calm, context-aware message:

"You're at home in your living room. It's Thursday afternoon. David is in the kitchen."

Activity Continuity - Maintains a rolling buffer of inferred activities. When the patient goes still, it retrieves what they were doing:

"You were making tea. The kettle is on the counter to your left."

Wandering Guardian - If the patient leaves their safe zone, it plays a familiar redirect instead of an alarm:

"Hey Dad, let's head back home."

Conversation Copilot - Transcribes live audio with Whisper, detects topic loops, and whispers private context only the patient hears:

"She's talking about the cottage trip last summer."


How We Built It

Meta Glasses (POV camera)
        │
        ▼
Phone → Laptop (screen capture, 1–2 FPS)
        │
        ▼
Python FastAPI Backend
  ├── InsightFace    — face recognition (ONNX, ~30–50ms)
  ├── OpenCV         — optical flow, motion detection
  ├── Gemini Vision  — scene classification, activity inference
  ├── ElevenLabs     — neural TTS through glasses speakers
  ├── Cloudinary     — family photos + Ken Burns montage video
  └── Backboard.io   — cross-session semantic memory
        │
   WebSocket → React Caregiver Dashboard (Vite + Tailwind)

Every captured frame is dispatched to all six modules simultaneously. Each module runs its own detection logic and fires independently - with cooldown timers to prevent the patient from being overwhelmed with overlapping audio.

Memory is dual-layered: SQLite for fast local reads, and Backboard.io for semantic cloud memory. Instead of querying a database, the system asks "What has Dad been doing in the last hour?" and gets back natural language context used to personalise every response.


Challenges We Ran Into

Latency. Every AI call costs time. We used cheap local heuristics (motion detection, optical flow) as gates before expensive Gemini or ElevenLabs calls, keeping the system responsive without burning tokens on every frame.

Glasses streaming. Meta glasses don't expose a clean video stream. We routed through a WhatsApp call captured by mss, and also built a direct H.264 WebSocket decoder via PyAV as a fallback. Both had failure modes we had to work around.

Keeping it human. The hardest challenge wasn't technical — it was tonal. Every message had to sound like it came from someone who loves the patient. We spent significant time tuning Gemini prompts and ElevenLabs voice profiles to land in the right register: calm, warm, specific, never clinical.

Concurrency. Six modules sharing a frame buffer, a database, and a WebSocket queue required careful async coordination - thread pool executors for the AI worker loop, asyncio for the API layer, and explicit guards on shared state.


Accomplishments That We're Proud Of

  • End-to-end pipeline working: glasses stream → face recognised → whisper played in under 3 seconds
  • Memory montages that actually feel emotional — narrated, animated, appearing on the caregiver's screen within seconds of a face being seen
  • Six parallel modules running on a 2 FPS pipeline with no audio overlap, thanks to feature gates and priority logic
  • A caregiver dashboard that gives real visibility — live feed, event log, task controls, montage player, family setup — not just a notification screen

What We Learned

The hardest part of building assistive AI is not the models, it's the interaction design. A technically correct response that arrives at the wrong moment, in the wrong tone, does more harm than silence. Every decision in Jarvis-eh? was ultimately a question about what a person with dementia actually needs to hear, and when.

We also learned that the people who care for someone with dementia are often invisible. Jarvis-eh? is as much a tool for caregivers as it is for the person wearing the glasses.


What's Next for Jarvis Eh?

  • Family voice cloning: deliver redirects and grounding in the actual voice of a family member the patient recognises
  • Emotion detection: more nuanced intervention timing based on facial expression analysis
  • Longitudinal analytics: surface confusion frequency, activity patterns, and wandering heat maps for caregivers and medical teams
  • Offline resilience: lightweight local models for face recognition and grounding when there's no internet, which is a real-world home care requirement

Stan Challenge, Our Journey

Built With

Share this project:

Updates