XR-RAG

Inspiration

We were really excited to explore the latest capabilities of LLM multimodal understanding combined with ambient computing & context understanding from XR devices like Snap Spectacles. The idea of having an AI assistant that can see what you see and respond naturally felt like the future we wanted to build.

What it does

We built an LLM conversational system that lets you have natural conversations with AI through Spectacles glasses. Our Spectacles client actively listens to your speech and captures what you're looking at, sends that to our backend, and speaks the AI's response back to you through the speakers.

Think of it as having a knowledgeable friend who can see through your eyes and help you understand your environment. In future iterations, we're planning to add conversation histories, RAG-based context retrievals, and tool call integrations.

How we built it

For the Spectacles side (TypeScript + Lens Studio):

We implemented real-time camera capture with custom JPEG compression to keep things fast, integrated speech-to-text using Spectacles' VoiceML, built a WebSocket client for lightning-fast communication, and added text-to-speech responses with visual feedback.

For the backend (Python + FastAPI):

We created a multi-provider LLM system that automatically switches between OpenAI, Gemini, and Claude APIs. Our WebSocket server handles the multimodal messages, and we built in-memory session management to keep conversations flowing smoothly.

Challenges we ran into

Learning Lens Studio from scratch
Wrestling with threading and non-blocking processing to keep Spectacles responsive
Getting reliable two-way socket communication working between client and server
Debugging sessions trying to optimize latency and make socket connections stable

Accomplishments that we're proud of

We got LLM integration + multimodal understanding working! 🎉
First Spectacle based XR app we've built, excited to learn this new tech!
We designed APIs that actually solve real problems we encountered

What we learned

XR is fun and interesting space to explore
Client-server socket based communications needs a lot of time for test/debugging/get right
XR, dictation, TTS, and LLMs are all resource intensive - need a lot of design thoughts & optimizations to get things right