Skip to content

[Chore] Configurable Jarvis-style voice prompts when waiting for user input #134

@atlas-apex

Description

@atlas-apex

Driver

When the assistant pauses for user input — design choices, per-PR merge approvals, ambiguous "which path do you want", etc. — the input request is a text message in the terminal. If the user has stepped away from the keyboard, the conversation stalls silently. The assistant is "waiting" but giving zero attentional signal.

A configurable text-to-speech layer that speaks the question out loud when the assistant ends a turn with a question (Jarvis-from-Iron-Man style) would surface those moments without requiring the user to babysit the terminal. The user still replies via keyboard — no voice input in the initial phase.

This is a quality-of-life improvement for adopters running long sessions (especially the multi-PR launch flows where the assistant fires 10+ "approved?" / "merge X?" / "design call: a/b/c?" prompts over an afternoon).

Scope

Initial phase: single platform (macOS), single TTS engine, conservative trigger heuristic. Cross-platform and higher-quality TTS providers are explicitly out of scope here — they're follow-ups (see Future phases below) once the trigger model is proven.

Mechanism

A new Stop hook at .claude/hooks/voice-prompt-on-pause.sh:

  1. Fires on assistant turn end (Stop event)
  2. Reads the last assistant message from the transcript JSON Claude Code passes to the hook (per the Claude Code hooks spec — the hook receives { transcript_path, ... } on stdin)
  3. If the message ends with a question or matches a "waiting for input" heuristic (see below), runs say -v <voice> "<message-or-summary>"
  4. Always exits 0 — TTS failure must never block the conversation

Trigger heuristic (initial phase, deliberately conservative)

The hook only speaks if the last assistant message:

  • Ends with ? (after stripping trailing whitespace + closing markdown formatting), OR
  • Contains a recognised "Reply with X" / "Approved?" / "Confirm Y" / "a/b/c" / "Reply <token>" pattern in its last paragraph

This skips informational summary messages (which often end with "."), tool-result reports, and progress updates. Initial heuristic prefers false-negatives over false-positives — a missed prompt is annoying; a TTS reading a 200-line tool output is unbearable.

What gets spoken

Not the full message — that would read aloud entire markdown tables, code blocks, and glossaries. The hook extracts:

  • The first sentence of the trailing paragraph that triggered the heuristic, OR
  • A configurable maximum (default: 200 chars), truncated at sentence boundary

Markdown is stripped before TTS (backticks, asterisks, links, bullets) — say doesn't interpret them well.

Configuration

Add a voice_prompts section to .claude/project-config.defaults.json:

{
  "voice_prompts": {
    "enabled": false,
    "voice": "Daniel",
    "max_chars": 200,
    "rate_wpm": 180,
    "trigger": "questions-only"
  }
}

Adopters override per-fork in .claude/project-config.json. The shipped default is off — no behaviour change for existing forks until they opt in.

Field Purpose
enabled Master switch. false makes the hook a no-op.
voice macOS say voice name. Default Daniel (British, premium-quality on modern macOS — closest free Jarvis-alike).
max_chars Caps the spoken length so a long question doesn't read for 30 seconds. Truncates at sentence boundary.
rate_wpm say -r <wpm> rate. Default 180 (slightly faster than say default; closer to spoken English pace).
trigger questions-only (default — heuristic above) or always (speaks every turn-end).

Wiring

Settings hook entry in .claude/settings.json:

{
  "hooks": {
    "Stop": [
      { "matcher": ".*", "hooks": [{ "type": "command", "command": "${ops_root}/.claude/hooks/voice-prompt-on-pause.sh" }] }
    ]
  }
}

The hook script reads config via the existing _lib-read-config.sh library.

Acceptance Criteria

  • .claude/hooks/voice-prompt-on-pause.sh exists, exits 0 in all branches, never blocks the conversation
  • When voice_prompts.enabled = true and the last assistant message ends with ?, the hook runs say -v <voice> -r <rate_wpm> "<truncated message>"
  • When enabled = false (the shipped default), the hook is a no-op (silent, no say invoked)
  • When enabled = true and the message does NOT match the trigger heuristic, no TTS — verified by feeding a few non-question turn-ends as test fixtures
  • Markdown is stripped before TTS — say doesn't read backticks / asterisks / link syntax aloud
  • Truncation at sentence boundary, not mid-word, respecting max_chars
  • .claude/project-config.defaults.json updated with the new schema
  • docs/project-config.md documents the new section
  • .claude/settings.json Stop-hook entry shipped (the actual hook stays a no-op until enabled, so this is safe)
  • Test fixtures at .claude/hooks/_tests/voice-prompt-on-pause/ covering: enabled+question, enabled+statement, disabled, malformed transcript JSON
  • AgDR documenting: TTS provider choice (macOS say vs OpenAI TTS vs ElevenLabs), trigger heuristic vs ML-based question detection, why we're shipping macOS-only first

Future phases (NOT in this ticket)

  • Phase 2 — cross-platform TTS (Linux espeak / spd-say, Windows Add-Type ... SpeechSynthesizer). Single config, OS-detection in the hook.
  • Phase 3 — high-quality TTS providers (OpenAI TTS, ElevenLabs). New config field provider. AgDR-worthy on cost vs quality.
  • Phase 4 — voice input. Whisper-based STT, voice-trigger phrases ("approve PR", "merge"). Significant scope; needs its own design.
  • Phase 5 — per-message overrides (a way for the assistant to mark a message as "speak this" or "skip TTS" inline).

Risks / Dependencies

  • macOS-only initial phase — adopters on Linux/Windows see no benefit until Phase 2. Acknowledged trade-off; the trigger model + config schema can be designed once and the platform layer added.
  • Trigger heuristic false positives — a tool-result or summary that happens to end with ? would get read aloud. Mitigation: deliberate conservative heuristic, configurable, easy to disable.
  • Hook overhead on every turn end — Stop hook fires every turn, even when disabled. The disabled path is [ "$(config_get '.voice_prompts.enabled')" = "true" ] || exit 0 — sub-millisecond. Negligible.
  • TTS sound-output collision with system audio — if the user is on a call when the hook fires, say will speak through the active output. Acceptable side-effect; adopter can disable per-session by exporting an env var the hook respects (Phase 2 enhancement).
  • Privacy — the hook reads the assistant's text from the local transcript file and pipes it to say, which is a local OS binary — nothing leaves the machine. When/if Phase 3 adds cloud TTS providers, that becomes a meaningful privacy decision worth its own AgDR.

Filed at the user's request — initial phase only (TTS-on-pause, macOS, keyboard-reply). Voice input is explicitly deferred to Phase 4.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions