Inspiration

We have too much stuff in our household... duplicates, old things, hand-me-downs of unknown size, the same thing stored in several places.

I wanted to get a handle on all of it, starting with the simple question: what exactly do we own? There are examples of people using image recognition for this, but that's incomplete and doesn't answer the question: what is it for?

To answer this I wanted to create a tool that can easily create that inventory, and then the tools to work with the resulting database.

What it does

Record: Say the start phrase ("start catalog") and then say where you are ("I'm in the living room"), and begin listing the items ("On the bookshelf there's a box of crayons, two half-used coloring books, and for some reason a light bulb"). Then say the stop phrase ("stop catalog").

Structure: When you stop the AI will take your transcript and structure it into a database. It will use the context, taking your conversational description of items and turning it into rows (name/description/location). The original transcript is kept with the item in case it misses something!

Organize: Go to the Memory Atlas website, associate it with your Omi account, and you can see and manipulate your database. A chat-based interface lets you ask questions, annotate, and categorize the database.

Query: Use the wake phrase "ask catalog" and then speak a question, with the stop phrase "done/stop/complete", and the AI will answer your question using the database you've created.

How we built it

Tech Stack: It's built on Next.js/Vercel functions, a Postgres database, and the GPT-4o API.

Transcript intake: The webhook keeps a running log of your transcript (in the db), watching for the activation phrases (start catalog/stop catalog, ask catalog/done, hello ####). When an actionable chunk is found it stores or responds to that, and marks the logged chunks as "processed".

Transforming to structured data: Start Catalog transcripts are fed through GPT to structure them, and then inserted into the database.

Authentication: For the web site the authentication is handled through an activation step: the user is asked on the website to repeat a phrase like "hello 2039" and then when the transcript comes in matching that phrase the browser is associated with the Omi user is who spoke the transcript. This creates a very simple (though not yet well-secured) pairing process. (It's a hackathon, so I punted some on the security!)

Chat: For chat we give the LLM model a set of tools it can use (in addition to its regular text responses). In addition to direct change and querying of the database, some tools result in changes that are confirmed by the user before being applied. In addition the LLM can see what the user sees of the database that's currently on the screen (so the user can refer to anything they see and have the chat know what's being referred to).

Challenges we ran into

Voice triggers: Start/stop phrases are quite awkward! But the combination of the notification (that has to be in response to a transcript), the partial transcripts, and no webhook calls during for silence makes it difficult to do a wake phrase only. For the actual data input (listing inventory by voice) this is fine, I want that mode to be sticky, but for other use cases it's not as easy. Lack of usable timestamps also makes it hard to understand when the transcript is coming in, and when there are pauses.

LLM: GPT has very specific ideas about how tools should work, and essentially expects tools to be executed synchronously, usually to query information. This can be smoothed over with filler responses and rearranging the chat history, but it all feels awkward.

Voice recognition: The speech recognition is pretty good, but for the cataloging especially it would be better if the audio could be put through a higher-latency higher-quality transcription like Whisper. And I'd even use prompting for Whisper, given some domain-specific knowledge about what's being transcribed.

Accomplishments that we're proud of

  1. Converting voice transcripts to inventory/structured data works really well
  2. The pairing process feels clever (too clever?! Maybe!)
  3. The tool use in the database management is potentially pretty nice, though still a very immature implementation

What we learned

  1. Voice endpointing with the Omi API is pretty hard (that is: determining the start and stop of a command, utterance, or segment)
  2. Structuring voice transcripts using an LLM works pretty well

What's next for Memory Atlas

  1. I want to add photo support, so it can take in a combination of timestamped transcripts and timestamped photos and do the extraction over both
  2. The prompt engineering can definitely use work
  3. Everything to do with multiple users is scrappy and incomplete
  4. I want to add "projects", so instead of just collecting one type of data (e.g., household inventory) you can switch between multiple kinds of data input. Probably this would be an addition to the wake phrase, like "start cataloging photos" or "start cataloging household items" to switch projects.
  5. Add some technique to let the LLM scanning over all individual items... with easily 1000 items in a database, if you want the LLM to specifically scan and analyze individual items you can't just give it the entire dump. But the LLM is also great at something like categorizing individual items given a criteria. One LLM prompt can take maybe dozens of rows. So I'd like a tool that the LLM can invoke that that itself launches an LLM task over the rows.
  6. More "large" LLM tasks over the dataset. Like if this is used to make a personal historical archive (e.g., looking through papers and photos and describing them) it might be nice to cap the whole thing off with a site-builder.
  7. Some formal sense of triage. For instance if you want to decide what household items you want to keep and give away, you might give criteria, then review the criteria, review cases that the LLM couldn't decide, etc. There's a workflow in there that could be supported as a first-class idea.

Built With

Share this project:

Updates