WisprClaw

## Inspiration

We spend most of our day in front of a computer, but interacting with AI assistants still means switching context — opening a browser tab, typing a prompt, waiting,
copying the response back. We wanted something that felt as natural as talking to a coworker: press a key, speak, and get an answer right where you're working. No browser, no tab switching, no copy-paste. We were also motivated by privacy — most voice assistants ship your audio to the cloud. We wanted transcription to stay entirely on-device.

## What it does

WisprClaw is a macOS menu bar voice assistant. You double-tap the Command key (or click the menu bar icon), speak, and get an AI agent response in a floating popup overlay — all without leaving whatever app you're in.

Under the hood, it captures microphone audio, transcribes it locally using OpenAI's Whisper model, optionally compresses the transcript with LLMLingua to reduce token cost, and sends it to an OpenClaw AI agent over a persistent WebSocket connection. The response appears in a frosted-glass HUD popup that auto-dismisses after 30 seconds. The entire voice-to-response pipeline runs with a single hotkey press.

## How we built it

The system has three decoupled layers:

Swift macOS app — A native menu bar application built with AppKit and SwiftUI, using Swift Package Manager with zero external dependencies. It handles audio recording (AVAudioEngine), a global double-tap Command hotkey (NSEvent monitors), a frosted-glass response popup (NSPanel + NSVisualEffectView), and a Settings UI with @AppStorage persistence.
Python transcription gateway — A local FastAPI server running OpenAI's Whisper model for speech-to-text. It optionally post-processes transcripts with Microsoft's LLMLingua-2 for token compression (~40% reduction), with auto-detection of the best compute device (MPS on Apple Silicon, CUDA, or CPU fallback).
OpenClaw WebSocket client — Implements the OpenClaw Gateway Protocol v3 with Curve25519 device identity signing, nonce-based challenge/response authentication, and a persistent connection that stays open across requests and automatically reconnects on failure.

We used a Swift actor (MessageBridge) to safely bridge URLSessionWebSocketTask's callback-based API with async/await, routing multiplexed WebSocket responses to the correct awaiting continuation by request ID.

## Challenges we ran into

WebSocket race condition — The OpenClaw server sends a challenge event immediately on connection. If our code wasn't listening yet, the nonce was silently dropped and the app deadlocked. We solved it by buffering the challenge in the message bridge so it's never lost regardless of timing.
macOS menu bar app activation quirks — In an accessory (menu-bar-only) app, calling setActivationPolicy(.regular) and immediately presenting a window in the same run loop tick silently fails. We had to dispatch window presentation asynchronously to give macOS a tick to process the policy change.
LLMLingua output format inconsistency — Different LLMLingua versions and API methods return results in different formats (raw string, or dicts with varying key names). We built a normalizer that tries multiple known keys and falls back gracefully to the original text.
Per-request WebSocket overhead — Each voice command was opening a fresh TCP + WebSocket connection, doing a full cryptographic handshake, then tearing it down — adding 300–500ms of latency. We refactored to a persistent connection with automatic reconnection, cutting subsequent requests down to just the agent call.

## Accomplishments that we're proud of

True local-first privacy — Audio never leaves the machine. Whisper runs entirely on-device, so there's zero cloud dependency for transcription.
Sub-second repeat interactions — The persistent WebSocket connection means the second voice command onwards skips the entire handshake, making the experience feel instant.
Zero external Swift dependencies — The entire macOS app is built on system frameworks only (AppKit, SwiftUI, CryptoKit, AVFoundation). No CocoaPods, no SPM dependencies, no bloat.
Seamless UX — Double-tap Command to talk, floating popup with the answer, auto-dismiss. It stays out of your way until you need it.
LLMLingua integration — Compressing verbose voice transcripts before they hit the agent reduces token costs and improves response latency, toggleable live from Settings without restarting anything.

## What we learned

Async protocol design requires defensive buffering. Never assume message ordering in a WebSocket protocol — always buffer events that might arrive before the listener is ready.
macOS accessory apps have subtle UI constraints. Window presentation, activation policies, and focus management all behave differently when the app has no dock icon. You have to work with the run loop, not against it.
Persistent connections are worth the complexity. The jump from per-request to persistent WebSocket felt like a big refactor, but the latency improvement was immediately noticeable and the reconnection logic is straightforward once the lifecycle is clear.
Local ML inference is practical on Apple Silicon. Running Whisper base + LLMLingua-2 on an M-series Mac via MPS is fast enough for real-time voice workflows — no GPU server needed.
Decoupling pays off early. Keeping the Swift app, Python gateway, and agent protocol as separate layers meant we could iterate on each independently — swap Whisper models, tune compression rates, or change the agent backend without touching the other pieces.