Voice-powered AI Agent for Code Generation and Web Search
A real-time voice interface that converts speech to code, generates pseudocode, and performs web searches using OpenAI Whisper STT and Tavily API.
graph TB
subgraph "Frontend Layer"
AGUI[AG-UI<br/>Port 3003<br/>Voice Interface]
MIC[Microphone<br/>Audio Capture]
end
subgraph "Gateway Layer"
GW[Gateway<br/>Port 3000/3001<br/>WebSocket + HTTP]
end
subgraph "Agent Layer (Mastra)"
AG[Agent<br/>Port 3002<br/>Voice Processing]
PL[Planner<br/>Execution Planning]
SK[Skills Registry<br/>Modular Functions]
end
subgraph "External APIs"
OPENAI[OpenAI Whisper<br/>Speech-to-Text]
TAVILY[Tavily API<br/>Web Search]
GPT[OpenAI GPT<br/>Code Generation]
WANDB[Weights & Biases<br/>Logging & Monitoring]
end
MIC --> AGUI
AGUI --> GW
GW --> AG
AG --> PL
PL --> SK
SK --> OPENAI
SK --> TAVILY
SK --> GPT
AG --> WANDB
AG --> GW
GW --> AGUI
- Location:
apps/ui/src/agUI.ts - Purpose: Main frontend application orchestrating voice interaction
- Components:
- MicController: Real-time audio capture with VAD (Voice Activity Detection)
- TranscriptPanel: Live speech-to-text display
- PlanPanel: Execution plan visualization
- PushToTalkButton: Voice recording interface
- CitationsDrawer: Web search references
- DiffViewer: Code change visualization
- Features: WebSocket communication, automatic reconnection, real-time UI updates
- Location:
apps/agent/src/agent/ - Purpose: Core AI agent runtime framework
- Components:
- AgentRuntime: Orchestrates planning, execution, and event emission
- Planner: Generates logical execution plans from voice transcripts
- MemoryManager: Conversation and pattern storage
- SafetyPolicyEnforcer: Blocks dangerous operations
- AgentEventEmitter: Structured event broadcasting with validation
- Features: Turn lifecycle management, safety-first design, dependency injection
- Location:
packages/skills/src/tavily.ts - Purpose: Real-time web search with citation tracking
- Features:
- Search Integration: Direct Tavily API calls with retry logic
- Citation Tracking: Emits
CITATIONSevents with{url, title}data - Error Handling: Graceful handling of rate limits and timeouts
- Result Processing: Maps search results to structured events
- Configuration:
TAVILY_API_KEY,TAVILY_TIMEOUT_MS=15000,TAVILY_MAX_RESULTS=5
- Integration: Built into the agent runtime for experiment tracking
- Purpose: Monitor agent performance, conversation patterns, and skill execution
- Features:
- Turn Logging: Track conversation turns and outcomes
- Skill Metrics: Monitor success rates and execution times
- Performance Analytics: Voice processing latency and accuracy
- Experiment Tracking: A/B test different configurations
- Node.js 18+
- OpenAI API Key
- Tavily API Key
# Clone and install
git clone <repo-url>
cd Voice_AG
npm install
# Set environment variables
cp .env.example .env
# Edit .env with your API keys:
# OPENAI_API_KEY=your_openai_key
# TAVILY_API_KEY=your_tavily_key
# Build all packages
npm run build
# Start services (3 terminals)
npm run dev:agent # Port 3002
npm run dev:gateway # Port 3000/3001
npm run dev:ui # Port 3003- Open http://localhost:3003
- Allow microphone access
- Click and hold to speak commands like:
- "Generate a bubble sort algorithm in Python"
- "Search for React hooks tutorial"
- "Create a binary search function in JavaScript"
- Real-time Voice Input - OpenAI Whisper STT integration
- Code Generation - Pseudocode and algorithm generation
- Web Search - Tavily-powered search with citations
- Execution Planning - Logical step-by-step task breakdown
- Event-driven Architecture - Real-time UI updates
- Modular Skills - Extensible skill system
| Component | Purpose | Tech Stack |
|---|---|---|
| UI | Voice interface & visualization | HTML5, TypeScript, Web Audio API |
| Gateway | Event routing & session management | Express, WebSocket |
| Agent | Voice processing & orchestration | Node.js, TypeScript |
| Skills | Modular AI functions | OpenAI API, Tavily API |
| Planner | Task decomposition & execution | Custom logic |
Voice_AG/
├── apps/
│ ├── agent/ # Voice processing server
│ ├── gateway/ # Event routing gateway
│ └── ui/ # Web interface
├── packages/
│ ├── shared/ # Common types & contracts
│ └── skills/ # AI skill implementations
└── .env # API keys configuration
- 📝 Transcript - Voice input transcription
- 📋 Execution Plan - Logical task steps
- 🔍 Citations - Web search results
- 💻 Code - Generated pseudocode/algorithms
- 🔧 Tool Log - Real-time execution status
# Watch mode for development
npm run dev
# Build specific package
cd apps/agent && npm run build
# Run tests
npm test# Required
OPENAI_API_KEY=sk-proj-...
TAVILY_API_KEY=tvly-dev-...
# Optional
OPENAI_MODEL=gpt-3.5-turbo
OPENAI_MAX_TOKENS=1500
TAVILY_SEARCH_DEPTH=basic- Voice Input → UI captures audio
- Transcription → Agent processes with Whisper
- Planning → Planner creates execution steps
- Execution → Skills execute tasks (search, generate)
- Results → Events flow back to UI for display
User speaks → PushToTalkButton.handlePress()
↓
MicController.startRecording()
↓
Captures audio (16kHz, PCM, 40ms frames)
↓
Voice Activity Detection (VAD)
↓
AudioChannel.sendAudioFrame(frame)
↓
WebSocket message to Gateway
{ type: 'audio', payload: { audioBuffer: [...] } }
VoiceGateway.handleMessage(clientId, data)
↓
Parse JSON and route by type
↓
handleAudio(clientId, payload)
↓
Validate authentication
↓
agentClient.sendAudio(audioBuffer, sessionId)
↓
Forward to Agent via HTTP POST
↓
Receive agent events
↓
broadcastEvent(event) to all UI clients
┌─────────────────────────────────────────────┐
│ A. SPEECH-TO-TEXT │
│ OpenAI Whisper API │
│ Audio → finalTranscript │
│ Emit: TRANSCRIPT_PARTIAL, TRANSCRIPT_FINAL│
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ B. AGENT RUNTIME │
│ VoiceAGAgent.processTurn() │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Step 1: PLANNING │
│ Planner.createPlan(transcript) │
│ • Analyze transcript intent │
│ • Generate PlanSteps: │
│ - needsWebSearch? → tavily.search │
│ - needsCodeGeneration? → pseudocode │
│ - needsUrlReading? → url.reader │
│ • Emit: PLAN_UPDATE │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Step 2: EXECUTION │
│ executePlanSteps(plan) │
│ For each step: │
│ 1. SafetyPolicyEnforcer.validateInput() │
│ 2. SkillsRegistry.getSkill(tool) │
│ 3. Execute skill with timeout │
│ 4. Emit: TOOL_START, TOOL_RESULT │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Step 3: SKILL EXECUTION │
│ • TavilySearchSkill │
│ • UrlReaderSkill │
│ • PseudocodeGenerationSkill │
│ Emit: CITATIONS, DIFF_DRAFT (if applicable)│
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Step 4: RESPONSE COMPOSITION │
│ composeFinalAnswer(transcript, results) │
│ • Aggregate successful results │
│ • Add citations if available │
│ • Emit: SPEAK │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Step 5: PERSISTENCE & COMPLETION │
│ • MemoryManager.storeTurn(turnData) │
│ • WandB.log(metrics) │
│ • Emit: DONE │
└─────────────────────────────────────────────┘
- OpenAI Whisper STT: Audio → Text transcription
- Tavily Search: Web search with citations
- OpenAI GPT-3.5: Code generation and pseudocode
- Weights & Biases: Logging and metrics tracking
Agent Events → Gateway → UI Components
Event Types:
• TRANSCRIPT_PARTIAL → TranscriptPanel (real-time text)
• TRANSCRIPT_FINAL → TranscriptPanel (confirmed text)
• PLAN_UPDATE → PlanPanel (execution steps)
• TOOL_START → ToolsLog (skill started)
• TOOL_RESULT → ToolsLog (skill completed)
• CITATIONS → CitationsDrawer (search results)
• DIFF_DRAFT → DiffViewer (code changes)
• SPEAK → Audio output (TTS)
• DONE → Complete session
• ERROR → Error display
Location: packages/skills/src/tavily.ts
// Execution Flow
1. Emit TOOL_START('tavily.search', { query, k })
2. callTavilyAPIWithRetry()
├─> fetch('https://api.tavily.com/search', {
│ method: 'POST',
│ headers: { Authorization: 'Bearer ${apiKey}' },
│ body: JSON.stringify({
│ query: input.query,
│ search_depth: 'basic' | 'advanced',
│ max_results: k
│ })
│ })
├─> Retry logic: 3 attempts with exponential backoff
├─> Timeout: 15 seconds
└─> Error handling: Rate limits, timeouts, server errors
3. Parse response → { query, results[], answer }
4. Emit TOOL_RESULT('tavily.search', result, 'ok')
5. Emit CITATIONS([{ url, title, anchor }])
6. Return SkillResult { success: true, data: {...} }Features:
- ✅ Automatic retry with exponential backoff
- ✅ Request timeout protection
- ✅ Rate limit detection and handling
- ✅ Citation extraction from results
- ✅ Sensitive data redaction in events
Location: packages/skills/src/pseudocode.ts
// Execution Flow
1. Emit TOOL_START('pseudocode.generate', { request, language })
2. Call OpenAI GPT-3.5 API
├─> Model: gpt-3.5-turbo
├─> Max Tokens: 1500
├─> Temperature: 0.7
└─> Prompt: Generate {language} pseudocode for {request}
3. Parse AI response → { code, explanation, steps }
4. Emit TOOL_RESULT('pseudocode.generate', result, 'ok')
5. Return SkillResult {
success: true,
data: {
code: 'generated pseudocode',
explanation: 'step-by-step explanation',
language: 'Python'
}
}Features:
- ✅ Multi-language support (Python, JavaScript, Java, C++, TypeScript)
- ✅ Step-by-step explanation generation
- ✅ Algorithm complexity analysis
- ✅ Best practices suggestions
Location: packages/skills/src/urlReader.ts
// Execution Flow
1. Emit TOOL_START('url.reader', { url, maxBytes })
2. Fetch URL content
├─> fetch(url, { timeout: 10s })
├─> Check content-type (text/html, application/json, etc.)
└─> Read response with size limits
3. Parse and extract relevant content
├─> HTML: Extract text, remove scripts/styles
├─> JSON: Pretty print structure
└─> Text: Return as-is
4. Emit TOOL_RESULT('url.reader', { content, metadata })
5. Return SkillResult { success: true, data: {...} }Features:
- ✅ Multiple content type support
- ✅ HTML parsing and cleanup
- ✅ Size limits for safety
- ✅ Timeout protection
// Audio Frame
AudioFrame {
samples: Int16Array, // PCM audio data
energy: number, // Audio energy level (0-1)
timestamp: number // Frame timestamp
}
// Execution Plan
ExecutionPlan {
id: string,
steps: PlanStep[],
totalEstimatedDuration: number
}
PlanStep {
id: string,
description: string,
tool: string, // e.g., 'tavily.search'
parameters: object,
dependencies: string[],
estimatedDuration: number
}
// Skill Result
SkillResult {
success: boolean,
data?: any,
error?: string,
citations?: Array<{ url: string, title?: string }>
}
// Turn Result
TurnResult {
success: boolean,
finalAnswer: string,
duration: number,
error?: AgentError
}- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
MIT License - see LICENSE file for details.
Built with ❤️ for the AI development community