Sentinel - AI Voice Assistant
Inspiration
The inspiration for Sentinel came from recognizing a gap in how people interact with AI assistants. While text-based chatbots have become ubiquitous, voice-first AI experiences that can understand context from your screen remain limited. We wanted to create an assistant that feels natural - one you can talk to while working on anything, whether you're coding, reading documents, researching, or solving problems.
Much like a sentinel, a soldier or guard whose role is to stand watch, Sentinel is designed to quietly observe your workspace, understand what’s in front of you, and step in only when needed.
The vision was simple: combine the natural flow of voice conversation with the power of visual context to create an AI assistant that truly understands what you're working on and can provide relevant, context-aware help in real-time.
What it does
Sentinel is a real-time AI voice assistant that combines:
- Voice Interaction: Natural, conversational voice interface powered by ElevenLabs STT and TTS
- Screen Context: Visual understanding of your screen to provide context-aware assistance
- Real-time Processing: LiveKit-powered infrastructure for low-latency voice conversations
- Intelligent Responses: Google Gemini integration for intelligent, helpful responses on any topic
- Session Insights: Automated session recaps with key insights and recommendations
Unlike traditional assistants limited to text, Sentinel understands not just what you're saying, but what you're looking at - whether that's code, documents, web pages, or any other content on your screen. Ask questions naturally, get instant answers, and receive helpful guidance tailored to your current context.
Key Features:
- Ask questions about anything - coding, general knowledge, problem-solving, and more
- Screen sharing integration for context-aware responses
- Real-time transcription and voice responses
- Session recaps with conversation summaries
- Privacy-first controls with optional screen blurring
How we built it
Architecture
Sentinel is built with a modern, real-time architecture:
Frontend (Next.js + TypeScript)
- Next.js 16 with React 19 for the web interface
- LiveKit React components for real-time audio/video
- Tailwind CSS for responsive, modern UI
- Zustand for state management
- Screen capture API for desktop sharing
Backend (Python)
- LiveKit Agents framework for voice pipeline
- ElevenLabs Scribe for speech-to-text
- Google Gemini 2.5 Flash for LLM responses
- ElevenLabs TTS for natural voice synthesis
- Silero VAD for voice activity detection
- Overshoot integration for screen analysis
Real-time Infrastructure
- LiveKit Cloud for WebRTC-based real-time communication
- WebSocket connections for bidirectional data messaging
- Token-based authentication for secure room access
Technology Stack
- Frontend: Next.js, React, TypeScript, Tailwind CSS
- Backend: Python 3.10+, LiveKit Agents, Gemini API
- STT/TTS: ElevenLabs Scribe & TTS
- LLM: Google Gemini 2.5 Flash
- Real-time: LiveKit Cloud
- Screen Analysis: Overshoot API
Key Implementation Details
- Voice Pipeline: Implemented a complete voice agent pipeline that handles VAD → STT → LLM → TTS with low latency
- Screen Context Sync: Real-time screen capture analysis and context messaging to the agent
- Data Protocol: Custom JSON message protocol for transcripts, status updates, and screen context
- Session Management: State management for conversations, transcripts, and session metadata
- UI/UX: Modern, glass-morphism design with real-time status indicators and conversation history
Challenges we ran into
1. Real-time Latency Optimization
One of our biggest challenges was minimizing end-to-end latency for natural voice conversations. We had to optimize:
- STT streaming for faster transcription
- LLM response generation time
- TTS audio synthesis speed
- Network round-trip times
Solution: We implemented streaming STT, chose Gemini 2.5 Flash for faster inference, and optimized the LiveKit pipeline for minimal delays.
2. Screen Context Integration
Integrating screen capture with meaningful context extraction proved complex. We needed to:
- Capture screen frames efficiently
- Analyze content without violating privacy
- Send relevant context to the agent in real-time
- Handle different display surfaces (monitor, window, tab)
Solution: Implemented a flexible screen sharing system with Overshoot API integration for intelligent content analysis, with privacy controls for sensitive data.
3. WebRTC and LiveKit Configuration
Setting up reliable WebRTC connections and understanding LiveKit's room/token system had a learning curve. Getting the agent and frontend to properly connect and maintain stable connections required careful configuration.
Solution: Spent time understanding LiveKit's architecture, implemented proper token generation, and used both Worker and Direct Join modes for flexibility.
4. Voice Activity Detection
Getting VAD to work smoothly - detecting when users start and stop speaking without cutting them off mid-sentence - required fine-tuning.
Solution: Used Silero VAD with appropriate thresholds and implemented proper turn-taking logic in the agent pipeline.
5. State Synchronization
Keeping frontend state in sync with agent state (transcripts, status, screen context) across WebSocket connections was challenging.
Solution: Implemented a robust data message protocol with timestamps and acknowledgments, plus client-side state management with Zustand.
Accomplishments that we're proud of
Complete Voice-to-Voice Pipeline: We built a fully functional, end-to-end voice assistant that handles the complete flow from speech input to voice output with real-time processing.
Context-Aware Intelligence: The integration of screen context with voice interaction creates a uniquely powerful experience - the assistant can see what you're working on and provide relevant help.
Beautiful, Modern UI: Created an intuitive, polished interface with glass-morphism design, real-time status indicators, and smooth animations that make the experience feel professional.
Flexible Architecture: Built a system that works across different domains - not just coding, but any question or task. The assistant adapts to whatever the user needs help with.
Production-Ready Infrastructure: Despite being a hackathon project, we built with scalability and reliability in mind - using LiveKit Cloud, proper error handling, and clean code architecture.
Privacy-First Design: Implemented controls for screen sharing with the ability to blur sensitive regions and pause sharing, putting user privacy first.
Session Insights: Automated recap generation that provides valuable summaries and learning recommendations after each session.
What we learned
Technical Learnings
- Real-time WebRTC: Deep dive into WebRTC protocols, LiveKit's architecture, and how to build low-latency voice applications
- Voice AI Pipeline: Understanding the complete flow from VAD through STT, LLM, and TTS - and how to optimize each stage
- Screen Capture APIs: Learning about browser screen sharing APIs, display surface detection, and privacy considerations
- State Management: Building robust state synchronization between frontend and backend over WebSocket connections
- LLM Integration: Working with Gemini API, prompt engineering, and context management for conversational AI
Product Learnings
- Voice UX: Voice interfaces require different design patterns than text - shorter responses, natural interruptions, clear status indicators
- Context is King: Screen context dramatically improves the quality and relevance of AI assistance
- Privacy Matters: Users need control over what they share - transparency and controls build trust
- Real-time Feels Magic: The real-time nature of the interaction makes it feel more like talking to a person than using a tool
Process Learnings
- API Integration Complexity: Integrating multiple APIs (LiveKit, ElevenLabs, Gemini, Overshoot) required understanding each service's capabilities and limitations
- Testing Voice Apps: Testing voice applications is more challenging than traditional web apps - requires different tools and approaches
- Documentation is Critical: Good documentation (like LiveKit's) makes a huge difference in development speed
What's next for Sentinel
We mainly want to expand Sentinel's features to also help keep you focused on your work, detecting when you're distracted and calling you out using the live camera feedback provided by Overshoot.
Short-term Improvements
- Enhanced Screen Analysis: Improve the screen context understanding with better OCR, layout analysis, and multi-modal understanding
- Multi-language Support: Add support for multiple languages in STT and TTS
- Voice Customization: Allow users to choose different voices, speeds, and personalities for the assistant
- Mobile App: Build native iOS and Android apps for mobile voice assistance
Platform Expansion
- Enterprise Features: Role-based access, team management, usage analytics, and compliance features
- API for Developers: Open up Sentinel's capabilities via API for developers to build their own voice assistants
- Marketplace: Create a marketplace for custom personalities, voices, and specialized assistant configurations
- Educational Focus: Specialized modes for learning, tutoring, and skill development with structured lessons
Technical Enhancements
- Offline Mode: Local processing options for privacy-sensitive use cases
- Cost Optimization: Implement caching, smart routing, and optimization to reduce API costs
The vision for Sentinel is to become the default way people interact with AI - not through typing, but through natural conversation enhanced by visual understanding. We see a future where voice-first, context-aware AI assistants help people be more productive, learn faster, and solve problems more effectively.
Log in or sign up for Devpost to join the conversation.