Inspiration
A few weeks ago, we saw a viral reel of an AI assistant coaching a player in real-time during a Valorant match. It was able to understand the game state, call out enemy positions, and react to situations instantly. It was really cool, but we thought: why limit this to a single game? We wanted to build something bigger.
What if an AI could understand spatial relationships across any environment—not just during one continuous session, but across multiple observations over time? Imagine an AI that remembers "the exit sign is above the window" even after you leave the room, then uses that knowledge hours later to navigate the same space or help someone else. An assistant that builds a persistent mental map (graph) of spaces and can react intelligently to any query about those relationships in real-time. We kept this idea in mind and decided to bring it to life!
What it does
Perception is a mobile app that gives LLMs real-time perception (haha) through camera and microphone input, with spatial awareness from a custom graph database. Most AI systems see -> act -> forget. Our system sees -> structures -> remembers -> reasons -> reuses.
Its core features include:
- Real time camera feed: continuous frame capture and analysis using GPT-4V
- Audio input & transcription: speech-to-text w/ Whisper API, Voice-Activity Detection
- Spatial graph memory: in-memory graph storing detected objects and their relationships (left_of, in_front_of...)
- Graph visualizer: interactive SVG diagram of spatial relationships with relationship-aware node positioning
- Accessible spoken walkthrough: TTS reads aloud the graph structure, along with closed captions synchronized throughout
- Voice personas: multiple TTS voices to customize LLM persona (friendly, serious, silly, flirty)
- MCP Server: RESTful tool exposure for LLM to query the spatial graph
The implications are endless! The most obvious one is for the robotics field. Right now, many robots struggle with semantic understanding and don't retain long-term contextual memory well. Our system adds semantic spatial memory, language-queryable world models, and persistent object relationships, so that robots could reorient itself faster, reason about absent objects, and collaborate with humans using natural language. Our system could also revolutionize interior design + architecture (huge sleeper use case), accessibility + cognitive prosthetics, gaming + mixed reality, and security + compliance fields.
How we built it
Our Process:
- Brainstorming: camera input -> vision API -> spatial graph (nodes + edges??) -> LLM queries. Thinking about a mobile frontend, backend graph server, and API integrations...
- Setup React Native mobile app and Node.js backend simultaneously to maximize time
- Built the spatial relationship parser incrementally: first hardcoded examples, then integrated GPT-4V + refined the prompt based on real camera frames
- Instead of storing raw transcripts, we invested early in a custom in-memory graph with confidence scoring, deduplication, and temporal awareness
- Added closed captions + voice personas early -> when ElevenLabs costs reached the limit for free credits, we pivoted to a smart fallback (caption cycling) rather than removing TTS entirely
- Tested with live camera feeds of our workspace, and debugged edge cases (overlapping objects, low-light, fast movement)
- Built the SVG visualizer and text-to-speech walkthrough together so they stayed in sync
We used a variety of tools in our tech stack:
- Frontend: React Native + Expo (for iOS/Android cross-platform)
- Backend: Node.js + Express (MCP server running on port 3001)
- Spatial memory: custom in-memory graph using confidence scoring, temporal tracking, and auto-pruning
- APIs: OpenAI (GPT-4V vision, Whisper transcription, LLM chat), ElevenLabs (TTS), Google Cloud Speech-to-Text
- Visualization: React-Native-SVG for dynamic graph rendering
- LLM integration: direct OpenAI function calls through tool endpoints
Steps to using the app:
- Mobile app continuously captures frames and audio
- GPT-4V analyzes frames for spatial relationship extraction
- Relationships + objects are sent to MCP server graph via REST
- LLM queries spatial memory to answer contextual questions
- ElevenLabs uses responses and relays them out loud to reply to you in real time or describe the user's surroundings
Challenges we ran into
- Balancing vision API cost/speed was tough, so we decided to implement frame sampling (1 second intervals)
- Objects with similar names merged unexpectedly sometimes, so we solved this problem using confidence scoring and fuzzy label matching
- We had a tough time figuring out SVG rendering to be accurate according to IRL placements, including issues with nodes overlapping in the visualizer (which we fixed with relationship-aware nudging + boundary clamping)
Accomplishments that we're proud of
- Relationship-aware visualization: having nodes actually positioned relative to their spatial relationships (ex. left_of -> physically left in the visual)
- Accessible design: closed captions + voice personas
- MCP server integration: opens doors to multi-LLM support including Claude and Gemini
- No hallucination: graph only contains what the vision API actually detected so the LLM is grounded in real perception
What we learned
- LLMs perform much better when questions are grounded in actual perception and not hallucinations
- Structured spatial relationships are much more useful compared to unstructured transcriptions
- It's important to consider accessibility first! Captions, visualizations, and voice personas unlock usability for far more people than audio alone
- Accuracy is more important than adding unreliable features. Having GPT estimate coordinates from frames would be amazing ... if it worked. We sacrificed a core feature because you're better off with nothing than garbage
What's next for Perception
- Persistent storage: using SQLite or a cloud backend for long-term spatial memory across different sessions and/or users
- Spatial queries: being able to ask things like "what's to my left?" and receiving a precise object list, or "describe the path to the exit" and getting a navigation route using landmarks and directions
- Multi-modal fusion: combining data from multiple sources (text, images, audio, sensor data...) to create a comprehensive and accurate 3D graph/model
Built With
- elevenlabs
- expo.io
- express.js
- google-cloud-speech-to-text
- node.js
- openai
- react-native
- react-native-svg
Log in or sign up for Devpost to join the conversation.