VoiceLayer

Your AI agent can't hear you. VoiceLayer gives it ears and a voice.

Voice I/O for AI coding assistants. Press F6, speak to Claude Code, get on-device transcription in under 1.5 seconds. Your AI speaks back. Works with any MCP client.

  You ──🎤──> whisper.cpp ──> Claude Code ──> edge-tts ──🔊──> You
         STT (local)           MCP tools         TTS (free)

Local-first. Free. Open-source. No cloud APIs, no API keys, no data leaves your machine. Part of the Golems ecosystem.

Website | Docs | npm

VoiceLayer runs as a persistent singleton daemon on a Unix socket — every Claude session connects through a lightweight socat shim instead of spawning its own process. 11 MCP tools with full ToolAnnotations.

Architecture

                  ┌─────────────────────────────────────┐
                  │         VoiceLayer Daemon            │
                  │     /tmp/voicelayer-mcp.sock         │
                  │                                      │
                  │  MCP JSONRPC ──> Tool Handlers       │
                  │  (Content-Length     ├── voice_speak  │
                  │   framing)          └── voice_ask    │
                  │                                      │
                  │  TTS: edge-tts (retry + 30s timeout) │
                  │  STT: whisper.cpp / Wispr Flow       │
                  │  VAD: Silero ONNX (speech detection)  │
                  │  IPC: Voice Bar ← NDJSON events      │
                  └──────────┬──────────────────────────┘
                             │ Unix socket
              ┌──────────────┼──────────────┐
              │              │              │
         Claude Code    Claude Code    Cursor/Codex
         (socat shim)  (socat shim)   (socat shim)

Why a daemon? The original design spawned a new Bun process per Claude session. With 17+ repos open, that meant 17 competing processes (700+ MB RAM), fighting over one Voice Bar socket, crashing edge-tts with PATH issues, and leaving orphans that never died. The daemon architecture — shipped in PRs #67-72 — replaced all of that with a single process and socat shims.

Metric	Before (spawn-per-session)	After (daemon)
Processes	N per session (17+ typical)	1 daemon + socat shims
RAM	~700 MB (17 x 41 MB)	~50 MB
Orphan cleanup	Manual `pkill`	PID lockfile auto-kills stale
edge-tts failures	Random (PATH, contention)	Retry with 30s hard timeout
voice_ask hang	Up to 300s (5 min!)	30s default + outer guard

Quick Start

# Install from npm
bun add -g voicelayer-mcp

# Prerequisites
brew install sox socat
pip3 install edge-tts
brew install whisper-cpp  # optional — local STT

# Download a whisper model (recommended)
mkdir -p ~/.cache/whisper
curl -L -o ~/.cache/whisper/ggml-large-v3-turbo.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin

Or install from source:

git clone https://github.com/EtanHey/voicelayer.git
cd voicelayer && bun install

Start the Daemon

# Option A: LaunchAgent (auto-start on login, auto-restart on crash)
./launchd/install.sh

# Option B: Manual
bun run src/mcp-server-daemon.ts

Configure MCP Clients

Add to your .mcp.json (in any repo where you use Claude Code):

{
  "mcpServers": {
    "voicelayer": {
      "command": "socat",
      "args": ["STDIO", "UNIX-CONNECT:/tmp/voicelayer-mcp.sock"]
    }
  }
}

Or migrate all repos at once:

bash scripts/migrate-to-daemon.sh         # migrates every .mcp.json under ~/Gits
bash scripts/migrate-to-daemon.sh --dry-run  # preview without changes

Grant microphone access to your terminal (macOS: System Settings > Privacy > Microphone).

Voice Tools

Primary tools

Tool	Behavior	Blocking	readOnly	destructive	idempotent
`voice_speak`	TTS with auto-mode (announce/brief/consult/think), replay, toggle	No	false	false	true
`voice_ask`	Speak question + record mic + transcribe response	Yes	false	false	false

Backward-compatible aliases

Alias	Maps to	idempotent
`qa_voice_announce`	`voice_speak(mode='announce')`	true
`qa_voice_brief`	`voice_speak(mode='brief')`	true
`qa_voice_consult`	`voice_speak(mode='consult')`	true
`qa_voice_say`	`voice_speak(mode='announce')`	true
`qa_voice_think`	`voice_speak(mode='think')`	false
`qa_voice_replay`	`voice_speak(replay_index=N)`	true
`qa_voice_toggle`	`voice_speak(enabled=bool)`	true
`qa_voice_converse`	`voice_ask`	false
`qa_voice_ask`	`voice_ask`	false

All 11 tools include MCP ToolAnnotations. No VoiceLayer tools are destructive. All have openWorldHint: false.

How voice_ask Works

Waits for any playing voice_speak audio to finish
Speaks the question via edge-tts (with retry on failure)
Records mic at device native rate, resamples to 16kHz
Silero VAD detects speech onset and silence end
whisper.cpp transcribes locally (~200-400ms on Apple Silicon)
Returns transcription to the AI agent

Reliability Features

PID lockfile (/tmp/voicelayer-mcp.pid): On startup, detects and kills any orphan MCP server from a previous session
edge-tts retry: Health check (cached 60s) + automatic retry with 30s hard timeout per attempt
Outer timeout guard: Promise.race wrapper around the entire voice_ask flow — if anything hangs, returns an error instead of blocking forever
Session booking: Lockfile mutex prevents mic conflicts between concurrent sessions

Recording Controls

Method	How
Stop signal	`touch ~/.local/state/voicelayer/stop-{token}`
VAD silence	Configurable: quick (0.5s), standard (1.5s), thoughtful (2.5s)
Timeout	30s default, configurable 5-3600s per call
Push-to-talk	`press_to_talk: true` — no VAD, stop on signal only

STT Backends

Backend	Type	Latency	Setup
whisper.cpp	Local (default)	~200-400ms	`brew install whisper-cpp` + model download
Wispr Flow	Cloud (fallback)	~500ms + network	Set `QA_VOICE_WISPR_KEY` env var

Auto-detected. Override with QA_VOICE_STT_BACKEND=whisper|wispr|auto.

Voice Bar (macOS)

Floating SwiftUI widget providing visual feedback during voice interactions. Connects to the daemon via NDJSON over /tmp/voicelayer.sock.

Teleprompter with word-level highlighting and auto-scroll
Waveform visualization during recording
Expandable pill UI — collapses to dot after 5s idle
Draggable, position persisted across launches
Global hotkey: Cmd+F6 (hold for push-to-talk, double-tap to toggle hands-free)

cd flow-bar && ./build-app.sh   # Build, codesign, install to /Applications

Hotkey Notes:

Requires Input Monitoring permission (System Settings > Privacy & Security)
Cmd+F6 is chosen to avoid conflicts with VoiceOver (Cmd+F5) — if conflicts occur, the hotkey can be reconfigured via HotkeyManager.configure()
Supports both F6 keyboard modes (function-key and media-key)

Advanced: Voice Cloning

Three-tier TTS engine cascade for cloned voices:

XTTS-v2 fine-tuned (cadence + timbre)
F5-TTS MLX zero-shot (local, no daemon)
Qwen3-TTS daemon (HTTP-based)
edge-tts fallback (always available)

voicelayer extract <youtube-url>   # Extract voice samples
voicelayer clone <name>            # Build voice profile
voicelayer daemon --port 8880      # Run Qwen3-TTS server

Environment Variables

Variable	Default	Description
`QA_VOICE_STT_BACKEND`	`auto`	STT backend: `whisper`, `wispr`, or `auto`
`QA_VOICE_WHISPER_MODEL`	auto-detected	Path to whisper.cpp GGML model
`QA_VOICE_WISPR_KEY`	--	Wispr Flow API key (cloud fallback)
`QA_VOICE_TTS_VOICE`	`en-US-JennyNeural`	edge-tts voice ID
`QA_VOICE_TTS_RATE`	`+0%`	Base speech rate

Testing

bun test   # 536 tests, 1638 assertions, 48 test files

Test coverage includes: MCP protocol framing, tool handlers, TTS synthesis + retry, VAD speech detection, session booking, process lock lifecycle, socket client reconnection, edge-tts health checks, schema validation, Hebrew STT eval baselines, daemon resilience, ToolAnnotations, SSML sanitization, and secure path hardening.

Project Structure

voicelayer/
├── src/                          # TypeScript/Bun (18K lines, 69 files)
│   ├── mcp-server-daemon.ts      # Singleton daemon entry point
│   ├── mcp-server.ts             # Stdio MCP server (legacy)
│   ├── mcp-daemon.ts             # Unix socket server (dual-protocol)
│   ├── mcp-framing.ts            # Content-Length + NDJSON framing
│   ├── mcp-handler.ts            # JSONRPC request router
│   ├── process-lock.ts           # PID lockfile (orphan prevention)
│   ├── handlers.ts               # Tool handler implementations
│   ├── tts.ts                    # Multi-engine TTS with playback queue
│   ├── tts-health.ts             # edge-tts health check + retry
│   ├── input.ts                  # Mic recording + STT pipeline
│   ├── vad.ts                    # Silero VAD (ONNX inference)
│   ├── stt.ts                    # STT backend abstraction
│   ├── socket-client.ts          # Voice Bar IPC (auto-reconnect)
│   ├── session-booking.ts        # Lockfile mutex
│   ├── paths.ts                  # Centralized path constants
│   └── __tests__/                # 536 tests across 48 files
├── flow-bar/                     # SwiftUI macOS app (1.9K lines, 9 files)
│   ├── Sources/VoiceBar/         # App source
│   └── Tests/                    # Swift tests
├── scripts/
│   ├── migrate-to-daemon.sh      # Batch .mcp.json migration
│   └── edge-tts-words.py         # Word-level TTS with timestamps
├── launchd/                      # macOS LaunchAgent auto-start
├── models/                       # Silero VAD ONNX model
└── package.json                  # v2.0.0

Platform Support

Platform	TTS	STT	Recording	Voice Bar
macOS	edge-tts + afplay	whisper.cpp (CoreML)	sox	SwiftUI app
Linux	edge-tts + mpv/ffplay	whisper.cpp	sox	--

Part of Golems

VoiceLayer is one of three open-source MCP servers in the Golems ecosystem:

Server	What it does	Tools
BrainLayer	Persistent memory for AI agents — knowledge graph + hybrid search	12
VoiceLayer	Voice I/O — local STT, neural TTS, F6 push-to-talk	11
cmuxLayer	Terminal orchestration — spawn panes, read screens, coordinate agents	22

Pair with BrainLayer to remember voice conversations across sessions.

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
.github		.github
docs		docs
eval		eval
flow-bar		flow-bar
landing		landing
launchd		launchd
models		models
scripts		scripts
site		site
src		src
tests		tests
voice_coach		voice_coach
.claude.local.md.example		.claude.local.md.example
.gitignore		.gitignore
.mcp.json.example		.mcp.json.example
AGENTS.md		AGENTS.md
BUGBOT_CODE_REVIEW.md		BUGBOT_CODE_REVIEW.md
BUGBOT_REVIEW.md		BUGBOT_REVIEW.md
BUGBOT_REVIEW_AUDIO_QUEUE.md		BUGBOT_REVIEW_AUDIO_QUEUE.md
BUGBOT_REVIEW_CMD_F5_HOTKEY.md		BUGBOT_REVIEW_CMD_F5_HOTKEY.md
BUGBOT_REVIEW_CMD_F6.md		BUGBOT_REVIEW_CMD_F6.md
BUGBOT_REVIEW_DAEMON_HARDENING.md		BUGBOT_REVIEW_DAEMON_HARDENING.md
BUGBOT_REVIEW_HOTKEY_FIX.md		BUGBOT_REVIEW_HOTKEY_FIX.md
BUGBOT_REVIEW_POLISH_PASS.md		BUGBOT_REVIEW_POLISH_PASS.md
BUGBOT_REVIEW_SUMMARY.md		BUGBOT_REVIEW_SUMMARY.md
BUGBOT_REVIEW_SUMMARY_CMD_F5.md		BUGBOT_REVIEW_SUMMARY_CMD_F5.md
BUGBOT_SUMMARY.md		BUGBOT_SUMMARY.md
BUG_REVIEW.md		BUG_REVIEW.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.details.md		CLAUDE.details.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
REVIEW_COMPLETE.md		REVIEW_COMPLETE.md
ROADMAP.md		ROADMAP.md
bun.lock		bun.lock
com.golems.tts-daemon.plist		com.golems.tts-daemon.plist
greptile.json		greptile.json
macroscope.md		macroscope.md
mkdocs.yml		mkdocs.yml
package.json		package.json
server.json		server.json
tsconfig.json		tsconfig.json
voice-coach.py		voice-coach.py
voices.json.example		voices.json.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoiceLayer

Architecture

Quick Start

Start the Daemon

Configure MCP Clients

Voice Tools

Primary tools

Backward-compatible aliases

How voice_ask Works

Reliability Features

Recording Controls

STT Backends

Voice Bar (macOS)

Advanced: Voice Cloning

Environment Variables

Testing

Project Structure

Platform Support

Part of Golems

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VoiceLayer

Architecture

Quick Start

Start the Daemon

Configure MCP Clients

Voice Tools

Primary tools

Backward-compatible aliases

How voice_ask Works

Reliability Features

Recording Controls

STT Backends

Voice Bar (macOS)

Advanced: Voice Cloning

Environment Variables

Testing

Project Structure

Platform Support

Part of Golems

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages