Inspiration
We were really excited to explore the latest capabilities of LLM multimodal understanding combined with ambient computing & context understanding from XR devices like Snap Spectacles. The idea of having an AI assistant that can see what you see and respond naturally felt like the future we wanted to build.
What it does
We built an LLM conversational system that lets you have natural conversations with AI through Spectacles glasses. Our Spectacles client actively listens to your speech and captures what you're looking at, sends that to our backend, and speaks the AI's response back to you through the speakers.
Think of it as having a knowledgeable friend who can see through your eyes and help you understand your environment. In future iterations, we're planning to add conversation histories, RAG-based context retrievals, and tool call integrations.
How we built it
For the Spectacles side (TypeScript + Lens Studio):
We implemented real-time camera capture with custom JPEG compression to keep things fast, integrated speech-to-text using Spectacles' VoiceML, built a WebSocket client for lightning-fast communication, and added text-to-speech responses with visual feedback.
For the backend (Python + FastAPI):
We created a multi-provider LLM system that automatically switches between OpenAI, Gemini, and Claude APIs. Our WebSocket server handles the multimodal messages, and we built in-memory session management to keep conversations flowing smoothly.
Challenges we ran into
- Learning Lens Studio from scratch
- Wrestling with threading and non-blocking processing to keep Spectacles responsive
- Getting reliable two-way socket communication working between client and server
- Debugging sessions trying to optimize latency and make socket connections stable
Accomplishments that we're proud of
- We got LLM integration + multimodal understanding working! 🎉
- First Spectacle based XR app we've built, excited to learn this new tech!
- We designed APIs that actually solve real problems we encountered
What we learned
- XR is fun and interesting space to explore
- Client-server socket based communications needs a lot of time for test/debugging/get right
- XR, dictation, TTS, and LLMs are all resource intensive - need a lot of design thoughts & optimizations to get things right
What's next for XR-RAG
- Clean up on API designs
- Streaming API support
- Enable retireval stack, and explore supporting tool calls
Built With
- fastapi
- langchain
- lensstudio
- openai
- python
- typescript
- websockets

Log in or sign up for Devpost to join the conversation.