Inspiration

The inspiration for Chorus is personal. We have friends and community members who possess full cognitive brilliance but are trapped behind a physical barrier. We watched as their muscles would not cooperate and their tools could not keep up.

We saw brilliant minds forced to simplify their complex thoughts just to save energy.

This observation is backed by a startling statistic. While spoken speech occurs at roughly 150 words per minute, users of Augmentative and Alternative Communication (AAC) devices are typically limited to just 15. This 10x disparity is the "Rate Gap."

For millions of people spanning the spectrum from Autism and Cerebral Palsy to ALS, this gap doesn't just slow down conversation. It fundamentally alters their identity. It forces them into a purely reactive role, waiting to be asked a question rather than initiating a thought.

We built Chorus to destroy the myth of the "Intellectual Gap." We realized the barrier isn't the user's mind. It is the latency of their tools. We wanted to build a system that didn't just help people speak, but helped them resonate.

What it does

Chorus is the world’s first Active AAC Operating System. Unlike traditional devices that are passive keyboards waiting for input, Chorus is an active participant in the conversation.

It uses a "Seven-Signal Engine" to triangulate the user's intent in real time. By listening to partner speech, checking the time of day, analyzing location, and recalling long-term memories, Chorus predicts exactly what the user wants to say before they even touch the screen.

Chorus operates in three distinct modes to serve different needs:

  • Text Mode: A predictive interface for literate users that utilizes "Smart Type" to turn single keystrokes into full, complex sentences.
  • Pictorial Mode: A visual-first interface optimized for motor planning. It features "Infinite Icon" which uses DALL-E 3 to generate custom symbols on the fly, ensuring a user's vocabulary is never limited by a static library.
  • Spark Mode: Our "anti-passenger" feature. It analyzes the room and suggests relevant conversation starters, giving users the agency to initiate social interactions rather than just answering questions.

Finally, Chorus solves "Identity Erasure." By selecting an emotional intent like Sarcastic, Excited, or Sorrowful, the AI generates audio that actually feels that way, restoring the nuance of the human spirit.

How we built it

Chorus is architected as a Multi-Agent "Web of AI" capable of orchestrating specialized models to replicate human cognition. The frontend is built with Next.js and TypeScript, serving as the housing for our Seven-Signal Brain.

Here is how the signals are processed:

  • Signal 1: The Ears (Listening). We utilize the Web Audio API and OpenAI Whisper for high-fidelity active listening. When toggled on, it transcribes the partner's speech to gain an immediate understanding of the conversation's context.
  • Signal 2: The Scheduler (Time). The system syncs with the user's daily itinerary. By comparing the current time of day to their planned schedule, Chorus validates suggestions to ensure they are relevant to the moment (e.g., prioritizing "Dinner" vocabulary at 6 PM).
  • Signal 3: The Memory (Brain). We integrated Backboard.io as a vector database for long-term object permanence. If a friend asks about "The Big Day," Chorus queries the history, remembers a graduation is coming up, and primes relevant replies.
  • Signal 4: The Frequency (Habits). Using MongoDB Atlas, we track the user's "Selection Velocity." If they choose "Sushi" more often than "Sandwich," the system learns that preference and ranks it higher to save keystrokes.
  • Signals 5 & 6: Grammar & Filtering (The Smart Type Engine). These signals work in tandem. Filtering ensures that if the context is "Food" and the user types "P," irrelevant words like "Paper" are removed in favor of "Pizza." Simultaneously, Grammar predicts the syntactic structure, determining if the user needs a noun or a verb next.
  • Signal 7: The Tone (Identity). To solve the "Robotic Voice" problem, we use ElevenLabs to inject emotion. By selecting a tone emoji—like "Happy" or "Angry"—the user can finally express how they feel, not just what they want.

All of this reasoning is synthesized by Gemini, which acts as the "Conductor" to ensure that the signals reach a consensus before offering a suggestion to the user.

Challenges we ran into

Our biggest hurdle was the "API Explosion."

In our initial logic, the Seven-Signal Brain was firing independent API calls for audio, location, memory, and grammar every few seconds. We were eating through quotas and, ironically, introducing latency into a tool designed to remove it.

We had to fundamentally re-engineer the architecture from a "Live-Everything" model to an Intelligent Caching System:

  • Semantic Caching: We stopped sending every context change to the LLM. If the semantic environment such as the topic or room hadn't shifted past a specific threshold, we served high-confidence cached predictions.
  • Orchestrated Debouncing: We synchronized the signals so they wait for a specific trigger, like a completed sentence from a partner, to aggregate into a single, efficient multimodal prompt rather than firing asynchronously.

Accomplishments that we're proud of

  • Restoring Vocal Identity: Seeing the difference between a robotic "I am happy" and an ElevenLabs-generated, laughter-filled "I'm happy!" was a defining moment. It proved we could transmit emotion, not just data.
  • The "Keystroke Savings" Metric: We were able to demonstrate that a complex 15-input sentence could be initiated and completed in just 2 to 4 interactions using our predictive rails, achieving a 90% reduction in physical effort.
  • Infinite Icon: Successfully piping a user's missing manual word meaning into DALL-E and getting a perfectly usable, stylized AAC icon back in seconds felt like magic. It ensures no user is ever silenced just because a developer forgot to draw an icon.

What we learned

We learned that agency is technical.

A non-verbal user's passivity isn't usually a personality trait. It is a design flaw in their tools. By reducing the "effort tax" of speaking, we saw how technology can shift a user from a passenger to a driver.

On the backend, we learned the immense complexity of Multimodal Orchestration. Getting an LLM, a vector database, a voice synthesis engine, and a real-time audio transcriber to agree on a single output in under a second required us to treat "Latency" as our number one enemy.

What's next for Chorus

  • Offline Local Models: Currently, Chorus relies on the cloud. We plan to distill our reasoning engine into a smaller, on-device model like Gemini Nano to ensure users can communicate even without an internet connection.
  • Eye-Tracking Integration: We want to optimize the UI specifically for gaze-based interaction, using the AI to predict not just what words to show, but where on the screen to place them to minimize eye strain.
  • Hardware Agnostic Expansion: Our goal is to ensure Chorus runs flawlessly on the legacy tablets and specialized hardware that many patients already own, lowering the barrier to entry for everyone.

Built With

Share this project:

Updates