Sentinel - AI Voice Assistant

Inspiration

The inspiration for Sentinel came from recognizing a gap in how people interact with AI assistants. While text-based chatbots have become ubiquitous, voice-first AI experiences that can understand context from your screen remain limited. We wanted to create an assistant that feels natural - one you can talk to while working on anything, whether you're coding, reading documents, researching, or solving problems.

Much like a sentinel, a soldier or guard whose role is to stand watch, Sentinel is designed to quietly observe your workspace, understand what’s in front of you, and step in only when needed.

The vision was simple: combine the natural flow of voice conversation with the power of visual context to create an AI assistant that truly understands what you're working on and can provide relevant, context-aware help in real-time.

What it does

Sentinel is a real-time AI voice assistant that combines:

Voice Interaction: Natural, conversational voice interface powered by ElevenLabs STT and TTS
Screen Context: Visual understanding of your screen to provide context-aware assistance
Real-time Processing: LiveKit-powered infrastructure for low-latency voice conversations
Intelligent Responses: Google Gemini integration for intelligent, helpful responses on any topic
Session Insights: Automated session recaps with key insights and recommendations

Unlike traditional assistants limited to text, Sentinel understands not just what you're saying, but what you're looking at - whether that's code, documents, web pages, or any other content on your screen. Ask questions naturally, get instant answers, and receive helpful guidance tailored to your current context.

Key Features:

Ask questions about anything - coding, general knowledge, problem-solving, and more
Screen sharing integration for context-aware responses
Real-time transcription and voice responses
Session recaps with conversation summaries
Privacy-first controls with optional screen blurring

How we built it

Architecture

Sentinel is built with a modern, real-time architecture:

Frontend (Next.js + TypeScript)

Next.js 16 with React 19 for the web interface
LiveKit React components for real-time audio/video
Tailwind CSS for responsive, modern UI
Zustand for state management
Screen capture API for desktop sharing

Backend (Python)

LiveKit Agents framework for voice pipeline
ElevenLabs Scribe for speech-to-text
Google Gemini 2.5 Flash for LLM responses
ElevenLabs TTS for natural voice synthesis
Silero VAD for voice activity detection
Overshoot integration for screen analysis

Real-time Infrastructure

LiveKit Cloud for WebRTC-based real-time communication
WebSocket connections for bidirectional data messaging
Token-based authentication for secure room access

Technology Stack

Frontend: Next.js, React, TypeScript, Tailwind CSS
Backend: Python 3.10+, LiveKit Agents, Gemini API
STT/TTS: ElevenLabs Scribe & TTS
LLM: Google Gemini 2.5 Flash
Real-time: LiveKit Cloud
Screen Analysis: Overshoot API

Key Implementation Details

Voice Pipeline: Implemented a complete voice agent pipeline that handles VAD → STT → LLM → TTS with low latency
Screen Context Sync: Real-time screen capture analysis and context messaging to the agent
Data Protocol: Custom JSON message protocol for transcripts, status updates, and screen context
Session Management: State management for conversations, transcripts, and session metadata
UI/UX: Modern, glass-morphism design with real-time status indicators and conversation history

Challenges we ran into

1. Real-time Latency Optimization

One of our biggest challenges was minimizing end-to-end latency for natural voice conversations. We had to optimize:

STT streaming for faster transcription
LLM response generation time
TTS audio synthesis speed
Network round-trip times

Solution: We implemented streaming STT, chose Gemini 2.5 Flash for faster inference, and optimized the LiveKit pipeline for minimal delays.

2. Screen Context Integration

Integrating screen capture with meaningful context extraction proved complex. We needed to:

Capture screen frames efficiently
Analyze content without violating privacy
Send relevant context to the agent in real-time
Handle different display surfaces (monitor, window, tab)

Solution: Implemented a flexible screen sharing system with Overshoot API integration for intelligent content analysis, with privacy controls for sensitive data.

3. WebRTC and LiveKit Configuration

Setting up reliable WebRTC connections and understanding LiveKit's room/token system had a learning curve. Getting the agent and frontend to properly connect and maintain stable connections required careful configuration.

Solution: Spent time understanding LiveKit's architecture, implemented proper token generation, and used both Worker and Direct Join modes for flexibility.

4. Voice Activity Detection

Getting VAD to work smoothly - detecting when users start and stop speaking without cutting them off mid-sentence - required fine-tuning.

Solution: Used Silero VAD with appropriate thresholds and implemented proper turn-taking logic in the agent pipeline.

5. State Synchronization

Keeping frontend state in sync with agent state (transcripts, status, screen context) across WebSocket connections was challenging.

Solution: Implemented a robust data message protocol with timestamps and acknowledgments, plus client-side state management with Zustand.

Accomplishments that we're proud of

Complete Voice-to-Voice Pipeline: We built a fully functional, end-to-end voice assistant that handles the complete flow from speech input to voice output with real-time processing.
Context-Aware Intelligence: The integration of screen context with voice interaction creates a uniquely powerful experience - the assistant can see what you're working on and provide relevant help.
Beautiful, Modern UI: Created an intuitive, polished interface with glass-morphism design, real-time status indicators, and smooth animations that make the experience feel professional.
Flexible Architecture: Built a system that works across different domains - not just coding, but any question or task. The assistant adapts to whatever the user needs help with.
Production-Ready Infrastructure: Despite being a hackathon project, we built with scalability and reliability in mind - using LiveKit Cloud, proper error handling, and clean code architecture.
Privacy-First Design: Implemented controls for screen sharing with the ability to blur sensitive regions and pause sharing, putting user privacy first.
Session Insights: Automated recap generation that provides valuable summaries and learning recommendations after each session.

What we learned

Technical Learnings

Real-time WebRTC: Deep dive into WebRTC protocols, LiveKit's architecture, and how to build low-latency voice applications
Voice AI Pipeline: Understanding the complete flow from VAD through STT, LLM, and TTS - and how to optimize each stage
Screen Capture APIs: Learning about browser screen sharing APIs, display surface detection, and privacy considerations
State Management: Building robust state synchronization between frontend and backend over WebSocket connections
LLM Integration: Working with Gemini API, prompt engineering, and context management for conversational AI

Product Learnings

Voice UX: Voice interfaces require different design patterns than text - shorter responses, natural interruptions, clear status indicators
Context is King: Screen context dramatically improves the quality and relevance of AI assistance
Privacy Matters: Users need control over what they share - transparency and controls build trust
Real-time Feels Magic: The real-time nature of the interaction makes it feel more like talking to a person than using a tool

Process Learnings

API Integration Complexity: Integrating multiple APIs (LiveKit, ElevenLabs, Gemini, Overshoot) required understanding each service's capabilities and limitations
Testing Voice Apps: Testing voice applications is more challenging than traditional web apps - requires different tools and approaches
Documentation is Critical: Good documentation (like LiveKit's) makes a huge difference in development speed

What's next for Sentinel

We mainly want to expand Sentinel's features to also help keep you focused on your work, detecting when you're distracted and calling you out using the live camera feedback provided by Overshoot.

Short-term Improvements

Enhanced Screen Analysis: Improve the screen context understanding with better OCR, layout analysis, and multi-modal understanding
Multi-language Support: Add support for multiple languages in STT and TTS
Voice Customization: Allow users to choose different voices, speeds, and personalities for the assistant
Mobile App: Build native iOS and Android apps for mobile voice assistance

Platform Expansion

Enterprise Features: Role-based access, team management, usage analytics, and compliance features
API for Developers: Open up Sentinel's capabilities via API for developers to build their own voice assistants
Marketplace: Create a marketplace for custom personalities, voices, and specialized assistant configurations
Educational Focus: Specialized modes for learning, tutoring, and skill development with structured lessons

Technical Enhancements

Offline Mode: Local processing options for privacy-sensitive use cases
Cost Optimization: Implement caching, smart routing, and optimization to reduce API costs

The vision for Sentinel is to become the default way people interact with AI - not through typing, but through natural conversation enhanced by visual understanding. We see a future where voice-first, context-aware AI assistants help people be more productive, learn faster, and solve problems more effectively.

Built With

elevenlabs
github
google-gemini
livekit
overshoot
python

Updates

Saif Zagha started this project — Jan 17, 2026 03:26 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.