abacus_Faces

Inspiration

A member of our team has a grandparent with dementia. Not late-stage — just the early, painful kind where they look at you, smile, and ask "sorry, who are you again?" every single time you visit. You know they love you. They just can't hold onto the information.

We looked for tools that could help. There's nothing. A few apps let caregivers manually enter information and show flashcard-style reminders — but dementia patients don't remember to open apps. They need something passive, something that works without any effort on their part.

Over 55 million people worldwide live with dementia, and that number is projected to hit 139 million by 2050. Beyond dementia, the same problem exists for people with face blindness (prosopagnosia), early-onset memory conditions, and even just people recovering from strokes or brain injuries. There is no real-time, frictionless tool that helps someone know who is standing in front of them. We decided to build it.

What it does

Faces runs on any device with a camera, microphone, and a browser. It watches and listens — simultaneously processing your live video feed and audio stream to build a complete understanding of who you're talking to and what's important.

When a face appears, Faces identifies them instantly. When they start talking, Faces listens.

Say "My name is Sarah" — the name appears on screen. Say "I'm your daughter" — the relationship badge updates live. Mention "your birthday party is next Saturday" — it gets captured as a key detail and pinned to that person's profile.

The core experience is that you just have a normal conversation, and Faces pulls out everything that matters:

Names and introductions
Relationships ("I'm your son", "I'm your nurse")
Upcoming appointments, events, and dates
Family members mentioned in passing
Anything emotionally significant

All of this is extracted purely from natural conversation — no manual data entry, no forms, no setup. And every detail persists across sessions. The next time that person walks in, everything Faces learned about them is right there, instantly.

How we built it

The entire experience runs in the browser with a Django backend — no installs, no native dependencies, fully cross-platform.

Face Detection & Recognition

Face recognition runs entirely client-side at full frame rate using face-api.js with SSD MobileNet detection and a 128-dimensional face descriptor model. The system provides continuous, uninterrupted tracking — it doesn't matter if the person moves, turns their head, walks across the room, or steps out and comes back. Their identity follows them. It can handle multiple faces simultaneously in a single frame, identifying and labeling each person independently with their own name, relationship, and memory context. Each face gets its own overlay with its own persistent data, all updated in real-time.

Speech-to-Text

Raw PCM audio is streamed from the browser's AudioContext over WebSocket in Float32 chunks. On the backend, we built a custom audio gate with silence detection and a tuned rolling buffer that ensures Whisper only processes actual speech segments. The buffer threshold is calibrated for responsiveness — transcription results come back while the conversation is still flowing, not seconds after someone finishes talking.

AI Context Engine

This is where Faces goes from a face recognition tool to a genuine memory aid. Every transcription chunk is analyzed by our AI pipeline through the Zen API, running two extraction tasks in parallel using asyncio:

Identity extraction — the model detects name introductions and relationship declarations from natural speech patterns. It understands "My name is John", "Call me doc", "I'm your grandson" and dozens of other natural phrasings without any hardcoded pattern matching.
Key information extraction — this is the state-of-the-art part. The model receives the full current memory state of the person alongside the new transcription, and returns an intelligently curated, updated memory list. It doesn't just blindly append — it understands context. If someone corrects their birthday, the old one gets replaced. If a detail is trivial small talk, it gets filtered out. The model acts as an automated context manager, deciding what's worth remembering and what isn't, maintaining a clean, relevant memory profile that a dementia patient would actually benefit from. All of this runs in the backend with zero manual intervention.

Backend & Persistence

Django + Django Channels running under Daphne for full async WebSocket support. The Zen API handles all LLM inference through an OpenAI-compatible endpoint. Face descriptors, names, relationships, and the curated memory list are all stored in SQLite and survive server restarts. A person's identity and everything Faces learned about them is permanent — close the app, reboot the machine, it all comes back.

Challenges we ran into

The hardest problem was a race condition between the face recognition and speech pipelines. When someone said "I'm your father", the speech handler would correctly update the relationship to "Father" — but an LLM call that started a few seconds earlier from the face pipeline would finish at the same moment and overwrite it back to "Known Person". We solved this by having the face handler re-read the database after its API call returns, and only writing if the speech handler hadn't already set a more accurate value.

Getting the audio pipeline to feel truly real-time was another major challenge. Our initial buffer configuration meant Whisper wouldn't see partial sentences, so you'd finish talking and wait several seconds for any response. We iterated on the silence detection thresholds, buffer sizes, and rolling tail parameters until the transcription felt conversational rather than delayed.

The AI context management took real work to get right. Our first approach just accumulated every detail, which led to contradictions — "Birthday is February 29th" and "Birthday is October 5th" sitting right next to each other. We redesigned the extraction pipeline so the model treats the memory list as a living document it actively manages, replacing outdated entries and filtering noise rather than blindly accumulating

Accomplishments that we're proud of

The potential usecases of our work is what we care most about. The statistics are staggering and underserved:

55 million people worldwide live with dementia, expected to reach 139 million by 2050
10 million people have prosopagnosia (face blindness) — they literally cannot recognize faces, even of family members
2 million people in the US alone live with aphasia — difficulty recalling names and words, typically after a stroke
Nearly 40% of people over 65 experience some form of age-associated memory impairment
Over 11 million Americans provide unpaid care for someone with dementia

Every single one of these groups could benefit from a tool that passively shows them who someone is and what matters. This isn't a niche product — it's a tool for hundreds of millions of people and their caregivers, and nothing like it exists.

Running a complete computer vision + speech processing + LLM pipeline in the browser and a single backend process, with no cloud vision API dependencies, is something we're genuinely proud of. The zero-setup experience — open a URL, allow camera and mic, and it just works — means the people who need it most don't need to be technical to use it.

The moment that sold us on this project: within a five-minute test conversation, Faces had extracted a name, a relationship, a birthday, an upcoming appointment, and a family member's name — all without anyone doing anything except talking naturally. That felt like something that could actually help someone.

What we learned

Working with Whisper for real-time audio was genuinely interesting. It's an incredible model, but it was designed for batch transcription, not live streaming — adapting it to work on rolling chunks of conversational audio with low latency required a lot of experimentation and creative buffer management. Getting real-time information extraction right is tricky but really rewarding. There's a huge difference between something that works on a clean demo sentence and something that works on messy, real human conversation with filler words, interruptions, and corrections.

We also explored some novel libraries we hadn't used before — face-api.js for browser-native face recognition, Django Channels with Daphne for async WebSocket handling, and the WebAudio API for raw PCM streaming. Stitching all of these together into a coherent real-time pipeline where video, audio, and AI inference all run concurrently was challenging but genuinely one of the more interesting systems we've built.

What's next for Abacus-Faces

AR glasses integration is the obvious next step. Running Faces on smart glasses so someone with dementia can walk around and have context appear in their actual field of view — no screen to check, no app to open, just natural vision augmented with memory.

Beyond that: multi-language support for non-English speakers, a caregiver dashboard so family members can proactively add context before visits, emotion and mood detection to track how interactions affect the patient, and on-device LLM inference to eliminate the API dependency and make the whole system work offline.