Inspiration

We were really excited to explore the latest capabilities of LLM multimodal understanding combined with ambient computing & context understanding from XR devices like Snap Spectacles. The idea of having an AI assistant that can see what you see and respond naturally felt like the future we wanted to build.

What it does

We built an LLM conversational system that lets you have natural conversations with AI through Spectacles glasses. Our Spectacles client actively listens to your speech and captures what you're looking at, sends that to our backend, and speaks the AI's response back to you through the speakers.

Think of it as having a knowledgeable friend who can see through your eyes and help you understand your environment. In future iterations, we're planning to add conversation histories, RAG-based context retrievals, and tool call integrations.

How we built it

For the Spectacles side (TypeScript + Lens Studio):

We implemented real-time camera capture with custom JPEG compression to keep things fast, integrated speech-to-text using Spectacles' VoiceML, built a WebSocket client for lightning-fast communication, and added text-to-speech responses with visual feedback.

For the backend (Python + FastAPI):

We created a multi-provider LLM system that automatically switches between OpenAI, Gemini, and Claude APIs. Our WebSocket server handles the multimodal messages, and we built in-memory session management to keep conversations flowing smoothly.

Challenges we ran into

  • Learning Lens Studio from scratch
  • Wrestling with threading and non-blocking processing to keep Spectacles responsive
  • Getting reliable two-way socket communication working between client and server
  • Debugging sessions trying to optimize latency and make socket connections stable

Accomplishments that we're proud of

  • We got LLM integration + multimodal understanding working! 🎉
  • First Spectacle based XR app we've built, excited to learn this new tech!
  • We designed APIs that actually solve real problems we encountered

What we learned

  • XR is fun and interesting space to explore
  • Client-server socket based communications needs a lot of time for test/debugging/get right
  • XR, dictation, TTS, and LLMs are all resource intensive - need a lot of design thoughts & optimizations to get things right

What's next for XR-RAG

  • Clean up on API designs
  • Streaming API support
  • Enable retireval stack, and explore supporting tool calls

Built With

Share this project:

Updates