Inspiration
According to a 2020 study done by Beukelman and Light, more than 97 million individuals around the world are in need of some kind of AAC (Augmentative and Alternative Communications) and may benefit from such to enhance their communication and increase their participation in their communities.
Cat is our approach to connecting the unconnected.
AAC commonly comes in the form of an app, providing the user with for example icons (pictograms) representing the words. This allows users to communicate without having to be able to read, write, speak. There is a variety of apps available for this, however all of them have one issue in common (actually, more than one!): They are slow. Many of them require you to navigate through a lot of sub menus, since there are a lot of words, and even then you won't be able to form more complex sentences. While the average human conversation is around 120-160 words per minute, the average WPM of AAC users comes down to about 3-20. This is not what we have in mind when thinking of "connecting the unconnected".
Our approach is to use a Large-Language-Model to accelerate communication via AAC apps. LLMs are known for being good at understanding context, even with massive data input, better than any human possibly can. We can use this to make users type less, speak more - without ever having to speak.
What it does
Just like any other app, the user can input what they want to say by using a dictionary of pictograms. The main difference is, that with our app, you can say more complex, natural sentences with usually not more than 3-4 inputs. Typing "I, apple" is enough to get the app to communicate your needs: "I want to eat an apple." This also works with complex phrases like "I, plane, USA, friend" -> "I'm going to fly to the USA to see a friend". That's cool, but you know, still requires you to find the words. How about using the context of conversation to provide suggestions, answers to questions? We implemented exactly that. The conversation mode allows the user of the app to listen to the interlocutor, get the meaning from the pictrogram representation of what was said. The app then provides the user with three quick-respond options. These options are not just random though. They not only take the previous conversation history into account, but also knowledge the LLM has on the user. Thats right! Users can create their own profile, setting their name, age, likes, dislikes etc. The app takes this into account. Since everything has a pictogram representation, users are able to use the app without having to be able to speak, hear, or even read.
How we built it
For the LLM in the background, we chose Googles Gemini, precisely the Gemini-1.5-Flash model, for a few simple reasons:
- it's cheap
- it's fast
- it just works™️
We send a json model of our request with a command (elaborate, suggest, etc.) to the LLM, and get the according response (json) back. We then just parse it and integrate it like any other API. To get the preferred output, we prompt engineer the base model with a few-shot system prompt. This allows us to execute even arbitrary instructions, like changing the language mid-convo. The pictograms are provided by ARASAACs API. For a more technical perspective, look through the sourcecode (please don't, I just hacked this stuff together and its probably the worst code I've ever written :sob:)
Challenges we ran into
While the internet problems resulted in a lot of ups and downs, eventually we managed to overcome all challenges in a viable manner. One huge challenge/fear was the delay. With the prototype using Gemini to generate responses on the cloud, we rely on the internet connection and Googles servers, potentially having to wait a few seconds for a response. This would make our accelerator decelerate instead - not something we want. Gladly, we were able to bulk manage requests and have everything come down to the bare minimum, reducing token amount etc.
Accomplishments that we're proud of
- Actually finishing the prototype!
- Having a project that solves a real world problem.
What we learned
- It's great to have a plan early on.
- It's worth to stay awake for 48 hours straight.
What's next for Cat AIAAC
There's a lot of features we want to get right, that sadly couldn't make their way into the prototype!
- Auto-complete for pictograms, ranking based on context (verb after subject, etc.)
- Offline-use by using fine-tuned on-board LLM
- Adding visual capabilites to represent webpages or forms, that would otherwise just be text, in pictogram representation
- a lot more!
Thank you for having us on board :)
Log in or sign up for Devpost to join the conversation.