Skip to content

NiharP31/EchoCode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Voice AG 🎤

Voice-powered AI Agent for Code Generation and Web Search

A real-time voice interface that converts speech to code, generates pseudocode, and performs web searches using OpenAI Whisper STT and Tavily API.

🏗️ Architecture

graph TB
    subgraph "Frontend Layer"
        AGUI[AG-UI<br/>Port 3003<br/>Voice Interface]
        MIC[Microphone<br/>Audio Capture]
    end
    
    subgraph "Gateway Layer"
        GW[Gateway<br/>Port 3000/3001<br/>WebSocket + HTTP]
    end
    
    subgraph "Agent Layer (Mastra)"
        AG[Agent<br/>Port 3002<br/>Voice Processing]
        PL[Planner<br/>Execution Planning]
        SK[Skills Registry<br/>Modular Functions]
    end
    
    subgraph "External APIs"
        OPENAI[OpenAI Whisper<br/>Speech-to-Text]
        TAVILY[Tavily API<br/>Web Search]
        GPT[OpenAI GPT<br/>Code Generation]
        WANDB[Weights & Biases<br/>Logging & Monitoring]
    end
    
    MIC --> AGUI
    AGUI --> GW
    GW --> AG
    AG --> PL
    PL --> SK
    SK --> OPENAI
    SK --> TAVILY
    SK --> GPT
    AG --> WANDB
    AG --> GW
    GW --> AGUI
Loading

🔧 Technology Integration

AG-UI (Voice Interface)

  • Location: apps/ui/src/agUI.ts
  • Purpose: Main frontend application orchestrating voice interaction
  • Components:
    • MicController: Real-time audio capture with VAD (Voice Activity Detection)
    • TranscriptPanel: Live speech-to-text display
    • PlanPanel: Execution plan visualization
    • PushToTalkButton: Voice recording interface
    • CitationsDrawer: Web search references
    • DiffViewer: Code change visualization
  • Features: WebSocket communication, automatic reconnection, real-time UI updates

Mastra (Agent Framework)

  • Location: apps/agent/src/agent/
  • Purpose: Core AI agent runtime framework
  • Components:
    • AgentRuntime: Orchestrates planning, execution, and event emission
    • Planner: Generates logical execution plans from voice transcripts
    • MemoryManager: Conversation and pattern storage
    • SafetyPolicyEnforcer: Blocks dangerous operations
    • AgentEventEmitter: Structured event broadcasting with validation
  • Features: Turn lifecycle management, safety-first design, dependency injection

Tavily API (Web Search)

  • Location: packages/skills/src/tavily.ts
  • Purpose: Real-time web search with citation tracking
  • Features:
    • Search Integration: Direct Tavily API calls with retry logic
    • Citation Tracking: Emits CITATIONS events with {url, title} data
    • Error Handling: Graceful handling of rate limits and timeouts
    • Result Processing: Maps search results to structured events
  • Configuration: TAVILY_API_KEY, TAVILY_TIMEOUT_MS=15000, TAVILY_MAX_RESULTS=5

Weights & Biases (Logging & Monitoring)

  • Integration: Built into the agent runtime for experiment tracking
  • Purpose: Monitor agent performance, conversation patterns, and skill execution
  • Features:
    • Turn Logging: Track conversation turns and outcomes
    • Skill Metrics: Monitor success rates and execution times
    • Performance Analytics: Voice processing latency and accuracy
    • Experiment Tracking: A/B test different configurations

🚀 Quick Start

Prerequisites

  • Node.js 18+
  • OpenAI API Key
  • Tavily API Key

Setup

# Clone and install
git clone <repo-url>
cd Voice_AG
npm install

# Set environment variables
cp .env.example .env
# Edit .env with your API keys:
# OPENAI_API_KEY=your_openai_key
# TAVILY_API_KEY=your_tavily_key

# Build all packages
npm run build

# Start services (3 terminals)
npm run dev:agent    # Port 3002
npm run dev:gateway  # Port 3000/3001  
npm run dev:ui       # Port 3003

Usage

  1. Open http://localhost:3003
  2. Allow microphone access
  3. Click and hold to speak commands like:
    • "Generate a bubble sort algorithm in Python"
    • "Search for React hooks tutorial"
    • "Create a binary search function in JavaScript"

🎯 Features

✅ Implemented

  • Real-time Voice Input - OpenAI Whisper STT integration
  • Code Generation - Pseudocode and algorithm generation
  • Web Search - Tavily-powered search with citations
  • Execution Planning - Logical step-by-step task breakdown
  • Event-driven Architecture - Real-time UI updates
  • Modular Skills - Extensible skill system

🔧 Core Components

Component Purpose Tech Stack
UI Voice interface & visualization HTML5, TypeScript, Web Audio API
Gateway Event routing & session management Express, WebSocket
Agent Voice processing & orchestration Node.js, TypeScript
Skills Modular AI functions OpenAI API, Tavily API
Planner Task decomposition & execution Custom logic

📁 Project Structure

Voice_AG/
├── apps/
│   ├── agent/          # Voice processing server
│   ├── gateway/        # Event routing gateway  
│   └── ui/            # Web interface
├── packages/
│   ├── shared/        # Common types & contracts
│   └── skills/        # AI skill implementations
└── .env               # API keys configuration

🎨 UI Panels

  • 📝 Transcript - Voice input transcription
  • 📋 Execution Plan - Logical task steps
  • 🔍 Citations - Web search results
  • 💻 Code - Generated pseudocode/algorithms
  • 🔧 Tool Log - Real-time execution status

🛠️ Development

# Watch mode for development
npm run dev

# Build specific package
cd apps/agent && npm run build

# Run tests
npm test

🔑 Environment Variables

# Required
OPENAI_API_KEY=sk-proj-...
TAVILY_API_KEY=tvly-dev-...

# Optional
OPENAI_MODEL=gpt-3.5-turbo
OPENAI_MAX_TOKENS=1500
TAVILY_SEARCH_DEPTH=basic

📊 Event Flow

  1. Voice Input → UI captures audio
  2. Transcription → Agent processes with Whisper
  3. Planning → Planner creates execution steps
  4. Execution → Skills execute tasks (search, generate)
  5. Results → Events flow back to UI for display

🔄 Complete End-to-End Flow

Layer 1: UI (Port 3003)

User speaks → PushToTalkButton.handlePress()
             ↓
         MicController.startRecording()
             ↓
    Captures audio (16kHz, PCM, 40ms frames)
             ↓
    Voice Activity Detection (VAD)
             ↓
    AudioChannel.sendAudioFrame(frame)
             ↓
    WebSocket message to Gateway
    { type: 'audio', payload: { audioBuffer: [...] } }

Layer 2: Gateway (Port 3000/3001)

VoiceGateway.handleMessage(clientId, data)
             ↓
    Parse JSON and route by type
             ↓
    handleAudio(clientId, payload)
             ↓
    Validate authentication
             ↓
    agentClient.sendAudio(audioBuffer, sessionId)
             ↓
    Forward to Agent via HTTP POST
             ↓
    Receive agent events
             ↓
    broadcastEvent(event) to all UI clients

Layer 3: Agent (Port 3002)

┌─────────────────────────────────────────────┐
│ A. SPEECH-TO-TEXT                            │
│    OpenAI Whisper API                        │
│    Audio → finalTranscript                   │
│    Emit: TRANSCRIPT_PARTIAL, TRANSCRIPT_FINAL│
└─────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────┐
│ B. AGENT RUNTIME                             │
│    VoiceAGAgent.processTurn()                │
└─────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────┐
│ Step 1: PLANNING                             │
│   Planner.createPlan(transcript)             │
│   • Analyze transcript intent                │
│   • Generate PlanSteps:                      │
│     - needsWebSearch? → tavily.search        │
│     - needsCodeGeneration? → pseudocode      │
│     - needsUrlReading? → url.reader          │
│   • Emit: PLAN_UPDATE                        │
└─────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────┐
│ Step 2: EXECUTION                            │
│   executePlanSteps(plan)                     │
│   For each step:                             │
│   1. SafetyPolicyEnforcer.validateInput()    │
│   2. SkillsRegistry.getSkill(tool)           │
│   3. Execute skill with timeout              │
│   4. Emit: TOOL_START, TOOL_RESULT           │
└─────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────┐
│ Step 3: SKILL EXECUTION                      │
│   • TavilySearchSkill                        │
│   • UrlReaderSkill                           │
│   • PseudocodeGenerationSkill                │
│   Emit: CITATIONS, DIFF_DRAFT (if applicable)│
└─────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────┐
│ Step 4: RESPONSE COMPOSITION                 │
│   composeFinalAnswer(transcript, results)    │
│   • Aggregate successful results             │
│   • Add citations if available               │
│   • Emit: SPEAK                              │
└─────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────┐
│ Step 5: PERSISTENCE & COMPLETION             │
│   • MemoryManager.storeTurn(turnData)        │
│   • WandB.log(metrics)                       │
│   • Emit: DONE                               │
└─────────────────────────────────────────────┘

Layer 4: External APIs

  • OpenAI Whisper STT: Audio → Text transcription
  • Tavily Search: Web search with citations
  • OpenAI GPT-3.5: Code generation and pseudocode
  • Weights & Biases: Logging and metrics tracking

Event Flow (Back to UI)

Agent Events → Gateway → UI Components

Event Types:
• TRANSCRIPT_PARTIAL    → TranscriptPanel (real-time text)
• TRANSCRIPT_FINAL      → TranscriptPanel (confirmed text)
• PLAN_UPDATE           → PlanPanel (execution steps)
• TOOL_START            → ToolsLog (skill started)
• TOOL_RESULT           → ToolsLog (skill completed)
• CITATIONS             → CitationsDrawer (search results)
• DIFF_DRAFT            → DiffViewer (code changes)
• SPEAK                 → Audio output (TTS)
• DONE                  → Complete session
• ERROR                 → Error display

🎯 Skill Execution Detail

Tavily Search Skill

Location: packages/skills/src/tavily.ts

// Execution Flow
1. Emit TOOL_START('tavily.search', { query, k })
2. callTavilyAPIWithRetry()
   ├─> fetch('https://api.tavily.com/search', {
        method: 'POST',
        headers: { Authorization: 'Bearer ${apiKey}' },
        body: JSON.stringify({
          query: input.query,
          search_depth: 'basic' | 'advanced',
          max_results: k
        })
      })
   ├─> Retry logic: 3 attempts with exponential backoff
   ├─> Timeout: 15 seconds
   └─> Error handling: Rate limits, timeouts, server errors
3. Parse response  { query, results[], answer }
4. Emit TOOL_RESULT('tavily.search', result, 'ok')
5. Emit CITATIONS([{ url, title, anchor }])
6. Return SkillResult { success: true, data: {...} }

Features:

  • ✅ Automatic retry with exponential backoff
  • ✅ Request timeout protection
  • ✅ Rate limit detection and handling
  • ✅ Citation extraction from results
  • ✅ Sensitive data redaction in events

Pseudocode Generation Skill

Location: packages/skills/src/pseudocode.ts

// Execution Flow
1. Emit TOOL_START('pseudocode.generate', { request, language })
2. Call OpenAI GPT-3.5 API
   ├─> Model: gpt-3.5-turbo
   ├─> Max Tokens: 1500
   ├─> Temperature: 0.7
   └─> Prompt: Generate {language} pseudocode for {request}
3. Parse AI response  { code, explanation, steps }
4. Emit TOOL_RESULT('pseudocode.generate', result, 'ok')
5. Return SkillResult {
     success: true,
     data: {
       code: 'generated pseudocode',
       explanation: 'step-by-step explanation',
       language: 'Python'
     }
   }

Features:

  • ✅ Multi-language support (Python, JavaScript, Java, C++, TypeScript)
  • ✅ Step-by-step explanation generation
  • ✅ Algorithm complexity analysis
  • ✅ Best practices suggestions

URL Reader Skill

Location: packages/skills/src/urlReader.ts

// Execution Flow
1. Emit TOOL_START('url.reader', { url, maxBytes })
2. Fetch URL content
   ├─> fetch(url, { timeout: 10s })
   ├─> Check content-type (text/html, application/json, etc.)
   └─> Read response with size limits
3. Parse and extract relevant content
   ├─> HTML: Extract text, remove scripts/styles
   ├─> JSON: Pretty print structure
   └─> Text: Return as-is
4. Emit TOOL_RESULT('url.reader', { content, metadata })
5. Return SkillResult { success: true, data: {...} }

Features:

  • ✅ Multiple content type support
  • ✅ HTML parsing and cleanup
  • ✅ Size limits for safety
  • ✅ Timeout protection

Key Data Structures

// Audio Frame
AudioFrame {
  samples: Int16Array,  // PCM audio data
  energy: number,       // Audio energy level (0-1)
  timestamp: number     // Frame timestamp
}

// Execution Plan
ExecutionPlan {
  id: string,
  steps: PlanStep[],
  totalEstimatedDuration: number
}

PlanStep {
  id: string,
  description: string,
  tool: string,         // e.g., 'tavily.search'
  parameters: object,
  dependencies: string[],
  estimatedDuration: number
}

// Skill Result
SkillResult {
  success: boolean,
  data?: any,
  error?: string,
  citations?: Array<{ url: string, title?: string }>
}

// Turn Result
TurnResult {
  success: boolean,
  finalAnswer: string,
  duration: number,
  error?: AgentError
}

🤝 Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open Pull Request

📄 License

MIT License - see LICENSE file for details.


Built with ❤️ for the AI development community

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors