Perception

Inspiration

A few weeks ago, we saw a viral reel of an AI assistant coaching a player in real-time during a Valorant match. It was able to understand the game state, call out enemy positions, and react to situations instantly. It was really cool, but we thought: why limit this to a single game? We wanted to build something bigger.

What if an AI could understand spatial relationships across any environment—not just during one continuous session, but across multiple observations over time? Imagine an AI that remembers "the exit sign is above the window" even after you leave the room, then uses that knowledge hours later to navigate the same space or help someone else. An assistant that builds a persistent mental map (graph) of spaces and can react intelligently to any query about those relationships in real-time. We kept this idea in mind and decided to bring it to life!

What it does

Perception is a mobile app that gives LLMs real-time perception (haha) through camera and microphone input, with spatial awareness from a custom graph database. Most AI systems see -> act -> forget. Our system sees -> structures -> remembers -> reasons -> reuses.

Its core features include:

Real time camera feed: continuous frame capture and analysis using GPT-4V
Audio input & transcription: speech-to-text w/ Whisper API, Voice-Activity Detection
Spatial graph memory: in-memory graph storing detected objects and their relationships (left_of, in_front_of...)
Graph visualizer: interactive SVG diagram of spatial relationships with relationship-aware node positioning
Accessible spoken walkthrough: TTS reads aloud the graph structure, along with closed captions synchronized throughout
Voice personas: multiple TTS voices to customize LLM persona (friendly, serious, silly, flirty)
MCP Server: RESTful tool exposure for LLM to query the spatial graph

The implications are endless! The most obvious one is for the robotics field. Right now, many robots struggle with semantic understanding and don't retain long-term contextual memory well. Our system adds semantic spatial memory, language-queryable world models, and persistent object relationships, so that robots could reorient itself faster, reason about absent objects, and collaborate with humans using natural language. Our system could also revolutionize interior design + architecture (huge sleeper use case), accessibility + cognitive prosthetics, gaming + mixed reality, and security + compliance fields.

How we built it

Our Process:

Brainstorming: camera input -> vision API -> spatial graph (nodes + edges??) -> LLM queries. Thinking about a mobile frontend, backend graph server, and API integrations...
Setup React Native mobile app and Node.js backend simultaneously to maximize time
Built the spatial relationship parser incrementally: first hardcoded examples, then integrated GPT-4V + refined the prompt based on real camera frames
Instead of storing raw transcripts, we invested early in a custom in-memory graph with confidence scoring, deduplication, and temporal awareness
Added closed captions + voice personas early -> when ElevenLabs costs reached the limit for free credits, we pivoted to a smart fallback (caption cycling) rather than removing TTS entirely
Tested with live camera feeds of our workspace, and debugged edge cases (overlapping objects, low-light, fast movement)
Built the SVG visualizer and text-to-speech walkthrough together so they stayed in sync

We used a variety of tools in our tech stack:

Frontend: React Native + Expo (for iOS/Android cross-platform)
Backend: Node.js + Express (MCP server running on port 3001)
Spatial memory: custom in-memory graph using confidence scoring, temporal tracking, and auto-pruning
APIs: OpenAI (GPT-4V vision, Whisper transcription, LLM chat), ElevenLabs (TTS), Google Cloud Speech-to-Text
Visualization: React-Native-SVG for dynamic graph rendering
LLM integration: direct OpenAI function calls through tool endpoints

Steps to using the app:

Mobile app continuously captures frames and audio
GPT-4V analyzes frames for spatial relationship extraction
Relationships + objects are sent to MCP server graph via REST
LLM queries spatial memory to answer contextual questions
ElevenLabs uses responses and relays them out loud to reply to you in real time or describe the user's surroundings

Challenges we ran into

Balancing vision API cost/speed was tough, so we decided to implement frame sampling (1 second intervals)
Objects with similar names merged unexpectedly sometimes, so we solved this problem using confidence scoring and fuzzy label matching
We had a tough time figuring out SVG rendering to be accurate according to IRL placements, including issues with nodes overlapping in the visualizer (which we fixed with relationship-aware nudging + boundary clamping)

Accomplishments that we're proud of

Relationship-aware visualization: having nodes actually positioned relative to their spatial relationships (ex. left_of -> physically left in the visual)
Accessible design: closed captions + voice personas
MCP server integration: opens doors to multi-LLM support including Claude and Gemini
No hallucination: graph only contains what the vision API actually detected so the LLM is grounded in real perception

What we learned

LLMs perform much better when questions are grounded in actual perception and not hallucinations
Structured spatial relationships are much more useful compared to unstructured transcriptions
It's important to consider accessibility first! Captions, visualizations, and voice personas unlock usability for far more people than audio alone
Accuracy is more important than adding unreliable features. Having GPT estimate coordinates from frames would be amazing ... if it worked. We sacrificed a core feature because you're better off with nothing than garbage

What's next for Perception

Persistent storage: using SQLite or a cloud backend for long-term spatial memory across different sessions and/or users
Spatial queries: being able to ask things like "what's to my left?" and receiving a precise object list, or "describe the path to the exit" and getting a navigation route using landmarks and directions
Multi-modal fusion: combining data from multiple sources (text, images, audio, sensor data...) to create a comprehensive and accurate 3D graph/model

Built With

elevenlabs
expo.io
express.js
google-cloud-speech-to-text
node.js
openai
react-native
react-native-svg

Updates

Iris Hu started this project — Feb 07, 2026 07:56 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.