Skip to content

fizznix/local-voice-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🤖 Local Voice Agent

A real-time voice conversational AI assistant that listens, understands, thinks, and speaks. Built for Apple Silicon Macs with local-first processing.

🎯 Features

  • 🎤 Real-time Speech Recognition - MLX Whisper (local, GPU-accelerated on Apple Silicon)
  • 🧠 Intelligent Responses - Choose between:
    • Groq LLM (ultra-fast, cloud-based, free tier)
    • Local LLaMA via llama.cpp (fully offline, runs locally)
  • 🔊 Natural Speech Synthesis - Kokoro TTS or macOS fallback
  • ⏱️ Performance Metrics - Real-time timing for each step (recording, transcribe, LLM, TTS)
  • 💬 Conversation Memory - Multi-turn conversation with context
  • 🛡️ Error Resilience - Automatic fallbacks and retry logic

📋 Tech Stack

Component Technology Notes
STT MLX Whisper Local GPU-accelerated, no auth needed
LLM Groq or llama.cpp Switch between cloud/local
TTS Kokoro or macOS say Fallback to built-in TTS
Audio sounddevice, scipy Cross-platform audio capture/playback
Framework Python 3.12+ Async-ready, minimal dependencies

🚀 Quick Start

Prerequisites

  • macOS (tested on Apple Silicon / M-series)
  • Python 3.10+
  • pip package manager

Installation

  1. Clone or navigate to the project:

    cd /path/to/voice-agent
  2. Create virtual environment:

    python -m venv .venv
    source .venv/bin/activate
  3. Install dependencies:

    pip install -r requirements.txt

    Or manually:

    pip install numpy sounddevice soundfile scipy mlx-whisper groq requests python-dotenv
  4. Install TTS (optional but recommended):

    # Kokoro TTS (better quality)
    pip install kokoro-onnx
    
    # Or use macOS built-in 'say' command (fallback, no install needed)
  5. Install local LLM (optional):

    # For offline LLM, install llama.cpp
    brew install llama.cpp

Environment Setup

Create a .env file in the project root:

# Groq API Key (get from https://console.groq.com)
GROQ_API_KEY=your_groq_api_key_here

# Hugging Face Token (optional, for model access)
HF_TOKEN=your_hf_token_here

⚙️ Configuration

Edit agent.py to customize:

# ─── CONFIG ───
SAMPLE_RATE = 16000              # Audio sample rate
SILENCE_THRESHOLD = 0.01         # Silence detection threshold
SILENCE_DURATION = 1.5           # Seconds to wait before ending recording
GROQ_MODEL = "meta-llama/..."   # Groq model ID

# LLM Selection
USE_LOCAL_LLM = False            # True = local llama.cpp, False = Groq
LOCAL_LLM_URL = "http://127.0.0.1:8080"  # llama.cpp endpoint

LLM Options

Option 1: Groq (Recommended for Quick Start)

USE_LOCAL_LLM = False

Setup:

  1. Get free API key from console.groq.com
  2. Add to .env: GROQ_API_KEY=your_key
  3. No additional setup needed!

Option 2: Local LLaMA via llama.cpp

USE_LOCAL_LLM = True
LOCAL_LLM_URL = "http://127.0.0.1:8080"

Setup:

  1. Start llama.cpp server:
    # Using ollama
    ollama serve
    
    # Or direct llama.cpp
    ./llama-server -m path/to/model.gguf -p "port 8080"
  2. Verify it's running: curl http://127.0.0.1:8080/v1/models

▶️ Running the Agent

Basic Run

source .venv/bin/activate
python agent.py

Output Example

==================================================
🤖 Voice Bot Ready! (Ctrl+C to quit)
   STT: MLX Whisper (mlx-community/whisper-tiny)
   LLM: Groq (meta-llama/llama-4-scout-17b-16e-instruct)
   TTS: Kokoro / macOS say
==================================================
🎤 Listening... (speak now)
📝 Captured 1.6s of audio
🗣️  You: Hello, how are you?
   ⏱️  Recording: 1.80s | Transcribe: 0.35s
🤖 Bot: I'm doing great, thanks for asking! How can I help you today?
   ⏱️  LLM: 0.82s
   🔊 Using Kokoro TTS
   ⏱️  TTS: 2.34s
   ⏱️  Total: 5.31s
--------------------------------------------------

Stop the Agent

Press Ctrl+C to exit gracefully.

📊 Performance Metrics

The agent prints real-time timing for each step:

  • Recording - Time to capture audio until silence detected
  • Transcribe - STT conversion (audio → text)
  • LLM - Time to generate response
  • TTS - Text-to-speech synthesis
  • Total - Complete conversation cycle

🔧 Troubleshooting

❌ "401 Unauthorized" - Hugging Face Auth

Problem: MLX Whisper can't download model

Solutions:

  1. Set HF_TOKEN in .env with your Hugging Face token
  2. Or run huggingface-cli login once
  3. Or use a smaller model: WHISPER_MODEL = "mlx-community/whisper-tiny"

❌ Audio Hardware Error

Problem: PortAudioError: Error starting stream

Solutions:

  1. Check microphone is connected: System Settings → Sound → Input
  2. Restart the agent (automatic retry included)
  3. Try different input device (check sounddevice.query_devices())

❌ Kokoro TTS Not Found

Problem: Kokoro failed: [Errno 2] No such file or directory

Solution:

  • Install: pip install kokoro-onnx
  • Or let it fall back to macOS say command automatically

❌ Can't Connect to llama.cpp

Problem: Cannot connect to llama.cpp at http://127.0.0.1:8080

Solutions:

  1. Start the llama.cpp server (see Configuration section)
  2. Check URL is correct and port 8080 is open
  3. Run curl http://127.0.0.1:8080/v1/models to verify

❌ No Speech Detected

Problem: Keeps saying "(no speech detected, listening again...)"

Solutions:

  1. Adjust SILENCE_THRESHOLD (make it more sensitive):
    SILENCE_THRESHOLD = 0.005  # Lower = more sensitive
  2. Check microphone volume
  3. Speak louder or closer to mic

📁 Project Structure

voice-agent/
├── agent.py           # Main voice agent logic
├── main.py            # Alternative entry point (optional)
├── README.md          # This file
├── pyproject.toml     # Project metadata
├── .env              # Configuration (create this)
├── .venv/            # Virtual environment
└── requirements.txt   # Python dependencies (create with pip freeze)

🛠️ Development

Generate requirements.txt

source .venv/bin/activate
pip freeze > requirements.txt

Enable Debug Logging

Add to agent.py:

import logging
logging.basicConfig(level=logging.DEBUG)

📝 API References

🎓 How It Works

  1. Listen 🎤 → Records audio until silence detected
  2. Transcribe 📝 → MLX Whisper converts audio to text
  3. Think 🧠 → LLM generates intelligent response
  4. Speak 🔊 → TTS converts response back to speech
  5. Repeat ↩️ → Maintains conversation history

Each step is timed and logged for performance analysis.

⚡ Performance Tips

  • Faster responses: Use Groq instead of local LLM
  • Offline mode: Use local llama.cpp (slower but no cloud)
  • Lower latency: Use whisper-tiny (already set)
  • Better quality: Switch to whisper-small (slower)

📄 License

MIT

🤝 Contributing

Contributions welcome! Areas for improvement:

  • Multi-language support
  • Custom wake words
  • Streaming responses
  • Long-term memory (persistent context)

Made for Apple Silicon Macs 🍎 | Local-first AI 🔐 | Real-time voice

About

A real-time voice conversational AI assistant that listens, understands, thinks, and speaks. Built for Apple Silicon Macs with local-first processing.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages