You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the assistant pauses for user input — design choices, per-PR merge approvals, ambiguous "which path do you want", etc. — the input request is a text message in the terminal. If the user has stepped away from the keyboard, the conversation stalls silently. The assistant is "waiting" but giving zero attentional signal.
A configurable text-to-speech layer that speaks the question out loud when the assistant ends a turn with a question (Jarvis-from-Iron-Man style) would surface those moments without requiring the user to babysit the terminal. The user still replies via keyboard — no voice input in the initial phase.
This is a quality-of-life improvement for adopters running long sessions (especially the multi-PR launch flows where the assistant fires 10+ "approved?" / "merge X?" / "design call: a/b/c?" prompts over an afternoon).
Scope
Initial phase: single platform (macOS), single TTS engine, conservative trigger heuristic. Cross-platform and higher-quality TTS providers are explicitly out of scope here — they're follow-ups (see Future phases below) once the trigger model is proven.
Mechanism
A new Stop hook at .claude/hooks/voice-prompt-on-pause.sh:
Fires on assistant turn end (Stop event)
Reads the last assistant message from the transcript JSON Claude Code passes to the hook (per the Claude Code hooks spec — the hook receives { transcript_path, ... } on stdin)
If the message ends with a question or matches a "waiting for input" heuristic (see below), runs say -v <voice> "<message-or-summary>"
Always exits 0 — TTS failure must never block the conversation
The hook only speaks if the last assistant message:
Ends with ? (after stripping trailing whitespace + closing markdown formatting), OR
Contains a recognised "Reply with X" / "Approved?" / "Confirm Y" / "a/b/c" / "Reply <token>" pattern in its last paragraph
This skips informational summary messages (which often end with "."), tool-result reports, and progress updates. Initial heuristic prefers false-negatives over false-positives — a missed prompt is annoying; a TTS reading a 200-line tool output is unbearable.
What gets spoken
Not the full message — that would read aloud entire markdown tables, code blocks, and glossaries. The hook extracts:
The first sentence of the trailing paragraph that triggered the heuristic, OR
A configurable maximum (default: 200 chars), truncated at sentence boundary
Markdown is stripped before TTS (backticks, asterisks, links, bullets) — say doesn't interpret them well.
Configuration
Add a voice_prompts section to .claude/project-config.defaults.json:
The hook script reads config via the existing _lib-read-config.sh library.
Acceptance Criteria
.claude/hooks/voice-prompt-on-pause.sh exists, exits 0 in all branches, never blocks the conversation
When voice_prompts.enabled = true and the last assistant message ends with ?, the hook runs say -v <voice> -r <rate_wpm> "<truncated message>"
When enabled = false (the shipped default), the hook is a no-op (silent, no say invoked)
When enabled = true and the message does NOT match the trigger heuristic, no TTS — verified by feeding a few non-question turn-ends as test fixtures
Markdown is stripped before TTS — say doesn't read backticks / asterisks / link syntax aloud
Truncation at sentence boundary, not mid-word, respecting max_chars
.claude/project-config.defaults.json updated with the new schema
docs/project-config.md documents the new section
.claude/settings.json Stop-hook entry shipped (the actual hook stays a no-op until enabled, so this is safe)
Test fixtures at .claude/hooks/_tests/voice-prompt-on-pause/ covering: enabled+question, enabled+statement, disabled, malformed transcript JSON
AgDR documenting: TTS provider choice (macOS say vs OpenAI TTS vs ElevenLabs), trigger heuristic vs ML-based question detection, why we're shipping macOS-only first
Future phases (NOT in this ticket)
Phase 2 — cross-platform TTS (Linux espeak / spd-say, Windows Add-Type ... SpeechSynthesizer). Single config, OS-detection in the hook.
Phase 3 — high-quality TTS providers (OpenAI TTS, ElevenLabs). New config field provider. AgDR-worthy on cost vs quality.
Phase 4 — voice input. Whisper-based STT, voice-trigger phrases ("approve PR", "merge"). Significant scope; needs its own design.
Phase 5 — per-message overrides (a way for the assistant to mark a message as "speak this" or "skip TTS" inline).
Risks / Dependencies
macOS-only initial phase — adopters on Linux/Windows see no benefit until Phase 2. Acknowledged trade-off; the trigger model + config schema can be designed once and the platform layer added.
Trigger heuristic false positives — a tool-result or summary that happens to end with ? would get read aloud. Mitigation: deliberate conservative heuristic, configurable, easy to disable.
Hook overhead on every turn end — Stop hook fires every turn, even when disabled. The disabled path is [ "$(config_get '.voice_prompts.enabled')" = "true" ] || exit 0 — sub-millisecond. Negligible.
TTS sound-output collision with system audio — if the user is on a call when the hook fires, say will speak through the active output. Acceptable side-effect; adopter can disable per-session by exporting an env var the hook respects (Phase 2 enhancement).
Privacy — the hook reads the assistant's text from the local transcript file and pipes it to say, which is a local OS binary — nothing leaves the machine. When/if Phase 3 adds cloud TTS providers, that becomes a meaningful privacy decision worth its own AgDR.
Filed at the user's request — initial phase only (TTS-on-pause, macOS, keyboard-reply). Voice input is explicitly deferred to Phase 4.
Driver
When the assistant pauses for user input — design choices, per-PR merge approvals, ambiguous "which path do you want", etc. — the input request is a text message in the terminal. If the user has stepped away from the keyboard, the conversation stalls silently. The assistant is "waiting" but giving zero attentional signal.
A configurable text-to-speech layer that speaks the question out loud when the assistant ends a turn with a question (Jarvis-from-Iron-Man style) would surface those moments without requiring the user to babysit the terminal. The user still replies via keyboard — no voice input in the initial phase.
This is a quality-of-life improvement for adopters running long sessions (especially the multi-PR launch flows where the assistant fires 10+ "approved?" / "merge X?" / "design call: a/b/c?" prompts over an afternoon).
Scope
Initial phase: single platform (macOS), single TTS engine, conservative trigger heuristic. Cross-platform and higher-quality TTS providers are explicitly out of scope here — they're follow-ups (see Future phases below) once the trigger model is proven.
Mechanism
A new Stop hook at
.claude/hooks/voice-prompt-on-pause.sh:{ transcript_path, ... }on stdin)say -v <voice> "<message-or-summary>"Trigger heuristic (initial phase, deliberately conservative)
The hook only speaks if the last assistant message:
?(after stripping trailing whitespace + closing markdown formatting), OR<token>" pattern in its last paragraphThis skips informational summary messages (which often end with "."), tool-result reports, and progress updates. Initial heuristic prefers false-negatives over false-positives — a missed prompt is annoying; a TTS reading a 200-line tool output is unbearable.
What gets spoken
Not the full message — that would read aloud entire markdown tables, code blocks, and glossaries. The hook extracts:
Markdown is stripped before TTS (backticks, asterisks, links, bullets) —
saydoesn't interpret them well.Configuration
Add a
voice_promptssection to.claude/project-config.defaults.json:{ "voice_prompts": { "enabled": false, "voice": "Daniel", "max_chars": 200, "rate_wpm": 180, "trigger": "questions-only" } }Adopters override per-fork in
.claude/project-config.json. The shipped default is off — no behaviour change for existing forks until they opt in.enabledfalsemakes the hook a no-op.voicesayvoice name. DefaultDaniel(British, premium-quality on modern macOS — closest free Jarvis-alike).max_charsrate_wpmsay -r <wpm>rate. Default 180 (slightly faster thansaydefault; closer to spoken English pace).triggerquestions-only(default — heuristic above) oralways(speaks every turn-end).Wiring
Settings hook entry in
.claude/settings.json:{ "hooks": { "Stop": [ { "matcher": ".*", "hooks": [{ "type": "command", "command": "${ops_root}/.claude/hooks/voice-prompt-on-pause.sh" }] } ] } }The hook script reads config via the existing
_lib-read-config.shlibrary.Acceptance Criteria
.claude/hooks/voice-prompt-on-pause.shexists, exits 0 in all branches, never blocks the conversationvoice_prompts.enabled = trueand the last assistant message ends with?, the hook runssay -v <voice> -r <rate_wpm> "<truncated message>"enabled = false(the shipped default), the hook is a no-op (silent, nosayinvoked)enabled = trueand the message does NOT match the trigger heuristic, no TTS — verified by feeding a few non-question turn-ends as test fixturessaydoesn't read backticks / asterisks / link syntax aloudmax_chars.claude/project-config.defaults.jsonupdated with the new schemadocs/project-config.mddocuments the new section.claude/settings.jsonStop-hook entry shipped (the actual hook stays a no-op until enabled, so this is safe).claude/hooks/_tests/voice-prompt-on-pause/covering: enabled+question, enabled+statement, disabled, malformed transcript JSONsayvs OpenAI TTS vs ElevenLabs), trigger heuristic vs ML-based question detection, why we're shipping macOS-only firstFuture phases (NOT in this ticket)
espeak/spd-say, WindowsAdd-Type ... SpeechSynthesizer). Single config, OS-detection in the hook.provider. AgDR-worthy on cost vs quality.Risks / Dependencies
?would get read aloud. Mitigation: deliberate conservative heuristic, configurable, easy to disable.[ "$(config_get '.voice_prompts.enabled')" = "true" ] || exit 0— sub-millisecond. Negligible.saywill speak through the active output. Acceptable side-effect; adopter can disable per-session by exporting an env var the hook respects (Phase 2 enhancement).say, which is a local OS binary — nothing leaves the machine. When/if Phase 3 adds cloud TTS providers, that becomes a meaningful privacy decision worth its own AgDR.Filed at the user's request — initial phase only (TTS-on-pause, macOS, keyboard-reply). Voice input is explicitly deferred to Phase 4.