Voice AG 🎤

Voice-powered AI Agent for Code Generation and Web Search

A real-time voice interface that converts speech to code, generates pseudocode, and performs web searches using OpenAI Whisper STT and Tavily API.

🏗️ Architecture

graph TB
    subgraph "Frontend Layer"
        AGUI[AG-UI<br/>Port 3003<br/>Voice Interface]
        MIC[Microphone<br/>Audio Capture]
    end
    
    subgraph "Gateway Layer"
        GW[Gateway<br/>Port 3000/3001<br/>WebSocket + HTTP]
    end
    
    subgraph "Agent Layer (Mastra)"
        AG[Agent<br/>Port 3002<br/>Voice Processing]
        PL[Planner<br/>Execution Planning]
        SK[Skills Registry<br/>Modular Functions]
    end
    
    subgraph "External APIs"
        OPENAI[OpenAI Whisper<br/>Speech-to-Text]
        TAVILY[Tavily API<br/>Web Search]
        GPT[OpenAI GPT<br/>Code Generation]
        WANDB[Weights & Biases<br/>Logging & Monitoring]
    end
    
    MIC --> AGUI
    AGUI --> GW
    GW --> AG
    AG --> PL
    PL --> SK
    SK --> OPENAI
    SK --> TAVILY
    SK --> GPT
    AG --> WANDB
    AG --> GW
    GW --> AGUI

🔧 Technology Integration

AG-UI (Voice Interface)

Location: apps/ui/src/agUI.ts
Purpose: Main frontend application orchestrating voice interaction
Components:
- MicController: Real-time audio capture with VAD (Voice Activity Detection)
- TranscriptPanel: Live speech-to-text display
- PlanPanel: Execution plan visualization
- PushToTalkButton: Voice recording interface
- CitationsDrawer: Web search references
- DiffViewer: Code change visualization
Features: WebSocket communication, automatic reconnection, real-time UI updates

Mastra (Agent Framework)

Location: apps/agent/src/agent/
Purpose: Core AI agent runtime framework
Components:
- AgentRuntime: Orchestrates planning, execution, and event emission
- Planner: Generates logical execution plans from voice transcripts
- MemoryManager: Conversation and pattern storage
- SafetyPolicyEnforcer: Blocks dangerous operations
- AgentEventEmitter: Structured event broadcasting with validation
Features: Turn lifecycle management, safety-first design, dependency injection

Tavily API (Web Search)

Location: packages/skills/src/tavily.ts
Purpose: Real-time web search with citation tracking
Features:
- Search Integration: Direct Tavily API calls with retry logic
- Citation Tracking: Emits CITATIONS events with {url, title} data
- Error Handling: Graceful handling of rate limits and timeouts
- Result Processing: Maps search results to structured events
Configuration: TAVILY_API_KEY, TAVILY_TIMEOUT_MS=15000, TAVILY_MAX_RESULTS=5

Weights & Biases (Logging & Monitoring)

Integration: Built into the agent runtime for experiment tracking
Purpose: Monitor agent performance, conversation patterns, and skill execution
Features:
- Turn Logging: Track conversation turns and outcomes
- Skill Metrics: Monitor success rates and execution times
- Performance Analytics: Voice processing latency and accuracy
- Experiment Tracking: A/B test different configurations

🚀 Quick Start

Prerequisites

Node.js 18+
OpenAI API Key
Tavily API Key

Setup

# Clone and install
git clone <repo-url>
cd Voice_AG
npm install

# Set environment variables
cp .env.example .env
# Edit .env with your API keys:
# OPENAI_API_KEY=your_openai_key
# TAVILY_API_KEY=your_tavily_key

# Build all packages
npm run build

# Start services (3 terminals)
npm run dev:agent    # Port 3002
npm run dev:gateway  # Port 3000/3001  
npm run dev:ui       # Port 3003

Usage

Open http://localhost:3003
Allow microphone access
Click and hold to speak commands like:
- "Generate a bubble sort algorithm in Python"
- "Search for React hooks tutorial"
- "Create a binary search function in JavaScript"

🎯 Features

✅ Implemented

Real-time Voice Input - OpenAI Whisper STT integration
Code Generation - Pseudocode and algorithm generation
Web Search - Tavily-powered search with citations
Execution Planning - Logical step-by-step task breakdown
Event-driven Architecture - Real-time UI updates
Modular Skills - Extensible skill system

🔧 Core Components

Component	Purpose	Tech Stack
UI	Voice interface & visualization	HTML5, TypeScript, Web Audio API
Gateway	Event routing & session management	Express, WebSocket
Agent	Voice processing & orchestration	Node.js, TypeScript
Skills	Modular AI functions	OpenAI API, Tavily API
Planner	Task decomposition & execution	Custom logic

📁 Project Structure

Voice_AG/
├── apps/
│   ├── agent/          # Voice processing server
│   ├── gateway/        # Event routing gateway  
│   └── ui/            # Web interface
├── packages/
│   ├── shared/        # Common types & contracts
│   └── skills/        # AI skill implementations
└── .env               # API keys configuration

🎨 UI Panels

📝 Transcript - Voice input transcription
📋 Execution Plan - Logical task steps
🔍 Citations - Web search results
💻 Code - Generated pseudocode/algorithms
🔧 Tool Log - Real-time execution status

🛠️ Development

# Watch mode for development
npm run dev

# Build specific package
cd apps/agent && npm run build

# Run tests
npm test

🔑 Environment Variables

# Required
OPENAI_API_KEY=sk-proj-...
TAVILY_API_KEY=tvly-dev-...

# Optional
OPENAI_MODEL=gpt-3.5-turbo
OPENAI_MAX_TOKENS=1500
TAVILY_SEARCH_DEPTH=basic

📊 Event Flow

Voice Input → UI captures audio
Transcription → Agent processes with Whisper
Planning → Planner creates execution steps
Execution → Skills execute tasks (search, generate)
Results → Events flow back to UI for display

🔄 Complete End-to-End Flow

Layer 1: UI (Port 3003)

User speaks → PushToTalkButton.handlePress()
             ↓
         MicController.startRecording()
             ↓
    Captures audio (16kHz, PCM, 40ms frames)
             ↓
    Voice Activity Detection (VAD)
             ↓
    AudioChannel.sendAudioFrame(frame)
             ↓
    WebSocket message to Gateway
    { type: 'audio', payload: { audioBuffer: [...] } }

Layer 2: Gateway (Port 3000/3001)

VoiceGateway.handleMessage(clientId, data)
             ↓
    Parse JSON and route by type
             ↓
    handleAudio(clientId, payload)
             ↓
    Validate authentication
             ↓
    agentClient.sendAudio(audioBuffer, sessionId)
             ↓
    Forward to Agent via HTTP POST
             ↓
    Receive agent events
             ↓
    broadcastEvent(event) to all UI clients

Layer 3: Agent (Port 3002)

┌─────────────────────────────────────────────┐
│ A. SPEECH-TO-TEXT                            │
│    OpenAI Whisper API                        │
│    Audio → finalTranscript                   │
│    Emit: TRANSCRIPT_PARTIAL, TRANSCRIPT_FINAL│
└─────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────┐
│ B. AGENT RUNTIME                             │
│    VoiceAGAgent.processTurn()                │
└─────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────┐
│ Step 1: PLANNING                             │
│   Planner.createPlan(transcript)             │
│   • Analyze transcript intent                │
│   • Generate PlanSteps:                      │
│     - needsWebSearch? → tavily.search        │
│     - needsCodeGeneration? → pseudocode      │
│     - needsUrlReading? → url.reader          │
│   • Emit: PLAN_UPDATE                        │
└─────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────┐
│ Step 2: EXECUTION                            │
│   executePlanSteps(plan)                     │
│   For each step:                             │
│   1. SafetyPolicyEnforcer.validateInput()    │
│   2. SkillsRegistry.getSkill(tool)           │
│   3. Execute skill with timeout              │
│   4. Emit: TOOL_START, TOOL_RESULT           │
└─────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────┐
│ Step 3: SKILL EXECUTION                      │
│   • TavilySearchSkill                        │
│   • UrlReaderSkill                           │
│   • PseudocodeGenerationSkill                │
│   Emit: CITATIONS, DIFF_DRAFT (if applicable)│
└─────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────┐
│ Step 4: RESPONSE COMPOSITION                 │
│   composeFinalAnswer(transcript, results)    │
│   • Aggregate successful results             │
│   • Add citations if available               │
│   • Emit: SPEAK                              │
└─────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────┐
│ Step 5: PERSISTENCE & COMPLETION             │
│   • MemoryManager.storeTurn(turnData)        │
│   • WandB.log(metrics)                       │
│   • Emit: DONE                               │
└─────────────────────────────────────────────┘

Layer 4: External APIs

OpenAI Whisper STT: Audio → Text transcription
Tavily Search: Web search with citations
OpenAI GPT-3.5: Code generation and pseudocode
Weights & Biases: Logging and metrics tracking

Event Flow (Back to UI)

Agent Events → Gateway → UI Components

Event Types:
• TRANSCRIPT_PARTIAL    → TranscriptPanel (real-time text)
• TRANSCRIPT_FINAL      → TranscriptPanel (confirmed text)
• PLAN_UPDATE           → PlanPanel (execution steps)
• TOOL_START            → ToolsLog (skill started)
• TOOL_RESULT           → ToolsLog (skill completed)
• CITATIONS             → CitationsDrawer (search results)
• DIFF_DRAFT            → DiffViewer (code changes)
• SPEAK                 → Audio output (TTS)
• DONE                  → Complete session
• ERROR                 → Error display

🎯 Skill Execution Detail

Tavily Search Skill

Location: packages/skills/src/tavily.ts

// Execution Flow
1. Emit TOOL_START('tavily.search', { query, k })
2. callTavilyAPIWithRetry()
   ├─> fetch('https://api.tavily.com/search', {
   │     method: 'POST',
   │     headers: { Authorization: 'Bearer ${apiKey}' },
   │     body: JSON.stringify({
   │       query: input.query,
   │       search_depth: 'basic' | 'advanced',
   │       max_results: k
   │     })
   │   })
   ├─> Retry logic: 3 attempts with exponential backoff
   ├─> Timeout: 15 seconds
   └─> Error handling: Rate limits, timeouts, server errors
3. Parse response → { query, results[], answer }
4. Emit TOOL_RESULT('tavily.search', result, 'ok')
5. Emit CITATIONS([{ url, title, anchor }])
6. Return SkillResult { success: true, data: {...} }

Features:

✅ Automatic retry with exponential backoff
✅ Request timeout protection
✅ Rate limit detection and handling
✅ Citation extraction from results
✅ Sensitive data redaction in events

Pseudocode Generation Skill

Location: packages/skills/src/pseudocode.ts

// Execution Flow
1. Emit TOOL_START('pseudocode.generate', { request, language })
2. Call OpenAI GPT-3.5 API
   ├─> Model: gpt-3.5-turbo
   ├─> Max Tokens: 1500
   ├─> Temperature: 0.7
   └─> Prompt: Generate {language} pseudocode for {request}
3. Parse AI response → { code, explanation, steps }
4. Emit TOOL_RESULT('pseudocode.generate', result, 'ok')
5. Return SkillResult {
     success: true,
     data: {
       code: 'generated pseudocode',
       explanation: 'step-by-step explanation',
       language: 'Python'
     }
   }

Features:

✅ Multi-language support (Python, JavaScript, Java, C++, TypeScript)
✅ Step-by-step explanation generation
✅ Algorithm complexity analysis
✅ Best practices suggestions

URL Reader Skill

Location: packages/skills/src/urlReader.ts

// Execution Flow
1. Emit TOOL_START('url.reader', { url, maxBytes })
2. Fetch URL content
   ├─> fetch(url, { timeout: 10s })
   ├─> Check content-type (text/html, application/json, etc.)
   └─> Read response with size limits
3. Parse and extract relevant content
   ├─> HTML: Extract text, remove scripts/styles
   ├─> JSON: Pretty print structure
   └─> Text: Return as-is
4. Emit TOOL_RESULT('url.reader', { content, metadata })
5. Return SkillResult { success: true, data: {...} }

Features:

✅ Multiple content type support
✅ HTML parsing and cleanup
✅ Size limits for safety
✅ Timeout protection

Key Data Structures

// Audio Frame
AudioFrame {
  samples: Int16Array,  // PCM audio data
  energy: number,       // Audio energy level (0-1)
  timestamp: number     // Frame timestamp
}

// Execution Plan
ExecutionPlan {
  id: string,
  steps: PlanStep[],
  totalEstimatedDuration: number
}

PlanStep {
  id: string,
  description: string,
  tool: string,         // e.g., 'tavily.search'
  parameters: object,
  dependencies: string[],
  estimatedDuration: number
}

// Skill Result
SkillResult {
  success: boolean,
  data?: any,
  error?: string,
  citations?: Array<{ url: string, title?: string }>
}

// Turn Result
TurnResult {
  success: boolean,
  finalAnswer: string,
  duration: number,
  error?: AgentError
}

🤝 Contributing

Fork the repository
Create feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open Pull Request

📄 License

MIT License - see LICENSE file for details.

Built with ❤️ for the AI development community

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
apps		apps
docs/specs		docs/specs
packages		packages
.gitignore		.gitignore
README.md		README.md
package.json		package.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

Voice AG 🎤

🏗️ Architecture

🔧 Technology Integration

AG-UI (Voice Interface)

Mastra (Agent Framework)

Tavily API (Web Search)

Weights & Biases (Logging & Monitoring)

🚀 Quick Start

Prerequisites

Setup

Usage

🎯 Features

✅ Implemented

🔧 Core Components

📁 Project Structure

🎨 UI Panels

🛠️ Development

🔑 Environment Variables

📊 Event Flow

🔄 Complete End-to-End Flow

Layer 1: UI (Port 3003)

Layer 2: Gateway (Port 3000/3001)

Layer 3: Agent (Port 3002)

Layer 4: External APIs

Event Flow (Back to UI)

🎯 Skill Execution Detail

Tavily Search Skill

Pseudocode Generation Skill

URL Reader Skill

Key Data Structures

🤝 Contributing

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages