Yell To Bob

🌟 Inspiration

In the age of content creators and founder-led brands, LinkedIn has become the new TikTok for professionals. Founders need to consistently publish valuable content to build their reach and personal brand. But managing social media while running a business is overwhelming—you're juggling Slack messages, drafting posts, and trying to stay on top of communications.

We asked ourselves: What if you could just talk to an AI assistant and have it handle everything for you? That's how Bob was born—a voice-first digital delegate that can post to LinkedIn, manage your messages, generate images for posts, and even help you craft content from your team's Slack discussions.

💡 What it does

Yell To Bob is a real-time voice assistant that acts as your personal social media manager and communication delegate:

🗣️ Voice-First Interaction: Talk naturally to Bob using real-time voice recognition powered by Deepgram Nova-3
📝 LinkedIn Publishing: Create and publish LinkedIn posts with AI-generated content and images
🎨 AI Image Generation: Generate professional images for your posts using Gemini's image capabilities
💬 Slack Integration: Read team channels, summarize discussions, and even create LinkedIn posts based on your team's conversations
🔍 Web Search: Get real-time information from the web to inform your content
📅 Calendar Integration: Check your schedule and create events through voice commands
🧠 Conversation Memory: Recall previous conversations using MongoDB vector search
🐦 X/Twitter Support: Post to X with the same voice-first experience

🛠️ How we built it

Architecture Overview

┌─────────────────┐     ┌─────────────────┐     ┌──────────────────┐
│   React + Vite  │────▶│    LiveKit      │────▶│  Python Backend  │
│   Frontend      │◀────│   Real-time     │◀────│  Multi-Agent     │
│   (Voice UI)    │     │   Audio/Video   │     │  System          │
└─────────────────┘     └─────────────────┘     └──────────────────┘
                                                         │
                        ┌────────────────────────────────┼────────────────────────────────┐
                        │                                │                                │
                        ▼                                ▼                                ▼
              ┌─────────────────┐            ┌─────────────────┐             ┌─────────────────┐
              │   Gemini 2.0    │            │   ElevenLabs    │             │   Stagehand     │
              │   Flash LLM     │            │   TTS           │             │   Browser       │
              └─────────────────┘            └─────────────────┘             │   Automation    │
                                                                             └─────────────────┘

Key Technologies

Frontend: React + Vite with LiveKit client for real-time voice streaming
Real-time Communication: LiveKit for low-latency audio streaming
Speech-to-Text: Deepgram Nova-3 for accurate real-time transcription
LLM: Google Gemini 2.0 Flash for intent routing and response generation
Text-to-Speech: ElevenLabs for natural-sounding voice responses
Multi-Agent System: Custom Python framework with specialized agents (LinkedIn, Slack, X/Twitter)
Browser Automation: Stagehand for automated social media posting
State Management: Redis for cross-agent state sharing
Conversation Memory: MongoDB with vector search for recalling past conversations
Observability: Arize Phoenix for LLM tracing and monitoring

🚧 Challenges we ran into

Real-time Voice Latency: Achieving low-latency voice interactions required careful optimization of the audio pipeline between LiveKit, STT, LLM, and TTS components.
Multi-Agent Coordination: Building a system where multiple specialized agents (LinkedIn, Slack, X) could seamlessly hand off conversations while maintaining context was complex.
Browser Automation Reliability: Automating LinkedIn posting through browser automation (Stagehand) required handling various edge cases, session management, and anti-bot measures.
LangGraph Workflow Integration: Implementing multi-step workflows for LinkedIn drafting with user confirmation loops using LangGraph required careful state management.
Voice User Experience: Designing natural conversation flows where the AI knows when to ask for confirmation vs. when to proceed autonomously.

🏆 Accomplishments that we're proud of

End-to-End Voice Pipeline: Successfully built a complete voice-to-action pipeline that can take a voice command and execute a LinkedIn post with an AI-generated image
Intelligent Agent Routing: The system intelligently routes conversations to specialized agents without users needing to specify which service they want to use
LangGraph Workflows: Implemented sophisticated multi-step workflows for content creation with built-in user approval flows
Cross-Agent Memory: Agents can recall and use information from previous conversations using vector search

📚 What we learned

The importance of designing voice-first UX – it's fundamentally different from chat interfaces
How to build modular multi-agent systems that can scale to handle different platforms
Real-time audio streaming is challenging but incredibly rewarding when it works
LangGraph is powerful for building stateful, multi-step AI workflows
The value of observability (Arize Phoenix) when debugging complex AI pipelines