Skip to content

feat(voice): implement real-time voice mode with cloud and local backends#24174

Merged
Abhijit-2592 merged 27 commits intomainfrom
abhijit-2592/voice-mode-2
Apr 24, 2026
Merged

feat(voice): implement real-time voice mode with cloud and local backends#24174
Abhijit-2592 merged 27 commits intomainfrom
abhijit-2592/voice-mode-2

Conversation

@Abhijit-2592
Copy link
Copy Markdown
Contributor

@Abhijit-2592 Abhijit-2592 commented Mar 30, 2026

Summary

This PR implements a real-time Voice Mode for Gemini CLI, allowing users to dictate prompts directly into the terminal. It supports both cloud-based transcription via the Gemini Live API and local-first transcription via Whisper (using whisper.cpp).

Fixes #24175

Details

  • Transcription Backends:
    • Gemini Live API (Cloud): High-accuracy, real-time transcription using Google's Live API. Requires an API key.
    • Whisper (Local): Privacy-focused, local-first transcription. Supports multiple model sizes (Tiny, Base, Large) and automatically manages model downloads to ~/.gemini/whisper_models/.
  • UI Integration:
    • New /voice slash command to toggle voice mode and switch backends.
    • Use /voice model to manage local Whisper models.
    • Push-To-Talk (PTT): Hold space to record, release to stop and submit.
    • Continuous Mode: Dictate naturally with real-time text updates in the input buffer.
    • Dedicated Voice settings in the configuration dialog.
  • Audio Infrastructure:
    • Uses sox (rec) for cross-platform audio capture.
    • Robust handling of audio streams, including automatic VAD (Voice Activity Detection) support where available.

Installation Requirements

To use Voice Mode, you must install the following dependencies:

1. SoX (Sound eXchange)

Required for capturing audio from your microphone.

  • macOS: brew install sox
  • Linux: sudo apt install sox libsox-fmt-all
  • Windows: Download and install from SoX SourceForge. Ensure sox.exe is in your PATH.

2. whisper-stream (for Local Transcription)

Required only if using the Whisper (Local) backend.

  • macOS: brew install whisper-cpp (The package provides the whisper-stream binary).
  • Other Platforms:
    1. Clone the whisper.cpp repository.
    2. Build the stream example: make stream.
    3. Rename the resulting stream binary to whisper-stream and move it to a directory in your PATH.

Testing

  1. Enable Voice Mode: Run /voice on in the CLI or toggle it in /settings.
  2. Push-To-Talk: Hold space, speak a prompt, and release. The text should appear in the input and be submitted.
  3. Switch Backends:
    • For Cloud: /settings -> Voice -> Transcription Backend -> Gemini Live.
    • For Local: /settings -> Voice -> Transcription Backend -> Whisper.
  4. Validation:
    • Run unit tests: npm test packages/core/src/voice/liveTranscriptionService.test.ts
    • Run integration tests: npm test integration-tests/voice-mode.test.ts

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have added/updated tests to cover my changes.
  • I have updated the documentation (if applicable).
  • I have run npm run preflight and all checks passed.

@Abhijit-2592 Abhijit-2592 requested a review from a team as a code owner March 30, 2026 00:33
@gemini-cli
Copy link
Copy Markdown
Contributor

gemini-cli Bot commented Mar 30, 2026

Hi @Abhijit-2592, thank you so much for your contribution to Gemini CLI! We really appreciate the time and effort you've put into this.

We're making some updates to our contribution process to improve how we track and review changes. Please take a moment to review our recent discussion post: Improving Our Contribution Process & Introducing New Guidelines.

Key Update: Starting January 26, 2026, the Gemini CLI project will require all pull requests to be associated with an existing issue. Any pull requests not linked to an issue by that date will be automatically closed.

Thank you for your understanding and for being a part of our community!

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive voice interaction system for the Gemini CLI. By supporting both cloud-based and local transcription backends, it provides users with flexible options for dictating prompts. The changes include significant updates to the UI to support voice recording states, new configuration options for managing voice settings, and the necessary backend infrastructure to handle audio streaming and transcription processing.

Highlights

  • Voice Mode Implementation: Introduced a real-time voice mode allowing users to dictate prompts using either the Gemini Live API (cloud) or Whisper (local).
  • UI/UX Enhancements: Added a new /voice slash command, push-to-talk functionality via the spacebar, and a dedicated voice settings dialog.
  • Audio Infrastructure: Integrated SoX for cross-platform audio capture and implemented a robust transcription service factory to manage different backends.
  • Local Model Management: Added a Whisper model manager to handle automatic downloads and configuration of local transcription models.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 30, 2026

Size Change: +40 kB (+0.12%)

Total Size: 33.8 MB

Filename Size Change
./bundle/chunk-CJQF4Z4P.js 0 B -49.2 kB (removed) 🏆
./bundle/chunk-HCNND337.js 0 B -669 kB (removed) 🏆
./bundle/chunk-JTXZMAH6.js 0 B -2.73 MB (removed) 🏆
./bundle/chunk-KXQ7D3UP.js 0 B -3.43 kB (removed) 🏆
./bundle/chunk-OJPMPJ6A.js 0 B -14.6 MB (removed) 🏆
./bundle/chunk-Z3HQYKEX.js 0 B -3.8 kB (removed) 🏆
./bundle/core-EMXUA7GA.js 0 B -47.7 kB (removed) 🏆
./bundle/devtoolsService-XE3OICRT.js 0 B -27.8 kB (removed) 🏆
./bundle/gemini-ESVRSJ42.js 0 B -578 kB (removed) 🏆
./bundle/interactiveCli-4VLJILXH.js 0 B -1.29 MB (removed) 🏆
./bundle/liteRtServerManager-4HSYFP3G.js 0 B -2.08 kB (removed) 🏆
./bundle/oauth2-provider-PWQ4XMU3.js 0 B -9.16 kB (removed) 🏆
./bundle/chunk-DN3EMB6X.js 3.43 kB +3.43 kB (new file) 🆕
./bundle/chunk-FD2S3KYG.js 14.6 MB +14.6 MB (new file) 🆕
./bundle/chunk-JUIFMB2M.js 3.8 kB +3.8 kB (new file) 🆕
./bundle/chunk-K3ZYGA7J.js 672 kB +672 kB (new file) 🆕
./bundle/chunk-MT623GTR.js 49.2 kB +49.2 kB (new file) 🆕
./bundle/chunk-X4MR4PEB.js 2.73 MB +2.73 MB (new file) 🆕
./bundle/core-IULKZ22Q.js 48 kB +48 kB (new file) 🆕
./bundle/devtoolsService-KBE5Y6NF.js 27.8 kB +27.8 kB (new file) 🆕
./bundle/gemini-553UOQFL.js 573 kB +573 kB (new file) 🆕
./bundle/interactiveCli-GTAJV72M.js 1.31 MB +1.31 MB (new file) 🆕
./bundle/liteRtServerManager-D3NUXXIV.js 2.08 kB +2.08 kB (new file) 🆕
./bundle/oauth2-provider-OLJEB2AH.js 9.16 kB +9.16 kB (new file) 🆕
ℹ️ View Unchanged
Filename Size Change
./bundle/bundled/third_party/index.js 8 MB 0 B
./bundle/chunk-34MYV7JD.js 2.45 kB 0 B
./bundle/chunk-5AUYMPVF.js 858 B 0 B
./bundle/chunk-5PS3AYFU.js 1.18 kB 0 B
./bundle/chunk-664ZODQF.js 124 kB 0 B
./bundle/chunk-DAHVX5MI.js 206 kB 0 B
./bundle/chunk-IUUIT4SU.js 56.5 kB 0 B
./bundle/chunk-MTD736U4.js 1.97 MB 0 B
./bundle/chunk-RJTRUG2J.js 39.8 kB 0 B
./bundle/cleanup-3RKECZLL.js 0 B -932 B (removed) 🏆
./bundle/devtools-36NN55EP.js 696 kB 0 B
./bundle/dist-T73EYRDX.js 356 B 0 B
./bundle/events-XB7DADIJ.js 418 B 0 B
./bundle/examples/hooks/scripts/on-start.js 188 B 0 B
./bundle/examples/mcp-server/example.js 1.43 kB 0 B
./bundle/gemini.js 4.97 kB 0 B
./bundle/getMachineId-bsd-TXG52NKR.js 1.55 kB 0 B
./bundle/getMachineId-darwin-7OE4DDZ6.js 1.55 kB 0 B
./bundle/getMachineId-linux-SHIFKOOX.js 1.34 kB 0 B
./bundle/getMachineId-unsupported-5U5DOEYY.js 1.06 kB 0 B
./bundle/getMachineId-win-6KLLGOI4.js 1.72 kB 0 B
./bundle/memoryDiscovery-NSOLCG4U.js 980 B 0 B
./bundle/multipart-parser-KPBZEGQU.js 11.7 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/client/main.js 222 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/_client-assets.js 229 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/index.js 13.4 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/types.js 132 B 0 B
./bundle/sandbox-macos-permissive-open.sb 890 B 0 B
./bundle/sandbox-macos-permissive-proxied.sb 1.31 kB 0 B
./bundle/sandbox-macos-restrictive-open.sb 3.36 kB 0 B
./bundle/sandbox-macos-restrictive-proxied.sb 3.56 kB 0 B
./bundle/sandbox-macos-strict-open.sb 4.82 kB 0 B
./bundle/sandbox-macos-strict-proxied.sb 5.02 kB 0 B
./bundle/src-QVCVGIUX.js 47 kB 0 B
./bundle/start-YKG77TL6.js 0 B -622 B (removed) 🏆
./bundle/tree-sitter-7U6MW5PS.js 274 kB 0 B
./bundle/tree-sitter-bash-34ZGLXVX.js 1.84 MB 0 B
./bundle/cleanup-XNHBMPY3.js 932 B +932 B (new file) 🆕
./bundle/start-7BUEMFYN.js 622 B +622 B (new file) 🆕

compressed-size-action

Comment thread packages/core/src/utils/binaryCheck.ts Fixed
Comment thread packages/core/src/utils/binaryCheck.ts Fixed
@Abhijit-2592 Abhijit-2592 marked this pull request as draft March 30, 2026 00:38
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a real-time Voice Mode for the Gemini CLI, supporting both cloud-based transcription via the Gemini Live API and local-first transcription using Whisper. Key features include push-to-talk functionality, new slash commands for voice control, and a management system for downloading Whisper models. The review feedback correctly identifies that the API key requirement should be conditional on the selected backend to allow local mode to function independently. Additionally, a critical security vulnerability was found in the binary check utility, which is susceptible to command injection.

Comment thread packages/cli/src/ui/components/InputPrompt.tsx Outdated
Comment thread packages/core/src/utils/binaryCheck.ts
@gemini-cli gemini-cli Bot added the area/core Issues related to User Interface, OS Support, Core Functionality label Mar 30, 2026
@Abhijit-2592 Abhijit-2592 force-pushed the abhijit-2592/voice-mode-2 branch from 0da31ca to c6c038f Compare March 31, 2026 02:29
@gemini-cli
Copy link
Copy Markdown
Contributor

gemini-cli Bot commented Apr 13, 2026

Hi there! Thank you for your interest in contributing to Gemini CLI.

To ensure we maintain high code quality and focus on our prioritized roadmap, we have updated our contribution policy (see Discussion #17383).

We only guarantee review and consideration of pull requests for issues that are explicitly labeled as 'help wanted'. All other community pull requests are subject to closure after 14 days if they do not align with our current focus areas. For this reason, we strongly recommend that contributors only submit pull requests against issues explicitly labeled as 'help-wanted'.

This pull request is being closed as it has been open for 14 days without a 'help wanted' designation. We encourage you to find and contribute to existing 'help wanted' issues in our backlog! Thank you for your understanding and for being part of our community!

@gemini-cli
Copy link
Copy Markdown
Contributor

gemini-cli Bot commented Apr 14, 2026

Hi there! Thank you for your interest in contributing to Gemini CLI.

To ensure we maintain high code quality and focus on our prioritized roadmap, we have updated our contribution policy (see Discussion #17383).

We only guarantee review and consideration of pull requests for issues that are explicitly labeled as 'help wanted'. All other community pull requests are subject to closure after 14 days if they do not align with our current focus areas. For this reason, we strongly recommend that contributors only submit pull requests against issues explicitly labeled as 'help-wanted'.

This pull request is being closed as it has been open for 14 days without a 'help wanted' designation. We encourage you to find and contribute to existing 'help wanted' issues in our backlog! Thank you for your understanding and for being part of our community!

@gemini-cli gemini-cli Bot closed this Apr 14, 2026
@Abhijit-2592 Abhijit-2592 reopened this Apr 15, 2026
@Abhijit-2592 Abhijit-2592 self-assigned this Apr 15, 2026
@Abhijit-2592 Abhijit-2592 force-pushed the abhijit-2592/voice-mode-2 branch from 7561876 to aab45dc Compare April 15, 2026 21:48
@Abhijit-2592 Abhijit-2592 marked this pull request as ready for review April 15, 2026 21:49
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an experimental voice mode to the Gemini CLI, allowing for both cloud-based transcription via the Gemini Live API and local transcription using Whisper. Key additions include a new voice configuration schema, a VoiceModelDialog for managing backends and models, and an AudioRecorder service. The feedback highlights critical issues regarding resource management, specifically a microphone access conflict when using the Whisper backend and a memory leak in the model download logic. Additionally, the current transcription handling in the UI is identified as a source of potential data loss during interleaved manual typing.

Comment thread packages/cli/src/ui/components/InputPrompt.tsx Outdated
Comment thread packages/cli/src/ui/components/InputPrompt.tsx Outdated
Comment thread packages/cli/src/ui/components/VoiceModelDialog.tsx Outdated
@Abhijit-2592 Abhijit-2592 force-pushed the abhijit-2592/voice-mode-2 branch 3 times, most recently from 9460f47 to aee45c1 Compare April 16, 2026 00:02
@Abhijit-2592 Abhijit-2592 requested a review from a team as a code owner April 16, 2026 00:02
@Abhijit-2592 Abhijit-2592 requested a review from a team as a code owner April 16, 2026 00:39
@gemini-cli
Copy link
Copy Markdown
Contributor

gemini-cli Bot commented Apr 16, 2026

Hi there! Thank you for your interest in contributing to Gemini CLI.

To ensure we maintain high code quality and focus on our prioritized roadmap, we have updated our contribution policy (see Discussion #17383).

We only guarantee review and consideration of pull requests for issues that are explicitly labeled as 'help wanted'. All other community pull requests are subject to closure after 14 days if they do not align with our current focus areas. For this reason, we strongly recommend that contributors only submit pull requests against issues explicitly labeled as 'help-wanted'.

This pull request is being closed as it has been open for 14 days without a 'help wanted' designation. We encourage you to find and contribute to existing 'help wanted' issues in our backlog! Thank you for your understanding and for being part of our community!

@gemini-cli gemini-cli Bot closed this Apr 16, 2026
@rsloane82-create

This comment was marked as spam.

- Modified InputPrompt to allow toggling recording even when the buffer is non-empty, enabling seamless voice mode resumption.
- Implemented a "moving baseline" strategy that updates the transcription baseline on every 'turnComplete' event. This ensures that new sentences append to previous ones rather than overwriting them, supporting both incremental (Gemini Live) and cumulative (Whisper) providers.
- Fixed a race condition by capturing the current buffer text as the baseline immediately upon the user pressing the spacebar to start recording.
- Ensured voice mode hints remain visible when the mode is enabled, regardless of whether the buffer contains text.
- Added a comprehensive suite of 6 unit tests in InputPrompt.test.tsx covering basic toggle, multi-turn transcription, and session resumption scenarios.
This commit addresses two critical issues preventing the `sandbox:docker` E2E integration test image from building successfully in the CI environment (specifically after the GitHub Actions runner update to Ubuntu 24.04):

1. **Permission Denied Error (EACCES)**: The `Dockerfile` was copying the `cli` and `core` .tgz packages as `root:root` but executing the subsequent `npm install` as the less privileged `node` user. This resulted in an EACCES permission denied error. The fix updates the `COPY` commands to use `--chown=node:node` to explicitly set ownership during the copy.

2. **Missing Dependencies**: The `gemini --version` sanity check at the end of the Docker build was failing with `ERR_MODULE_NOT_FOUND` because the newly added code for voice mode and file utilities was missing its required dependencies (`command-exists` and `isbinaryfile`) in the `@google/gemini-cli-core` package's `package.json`. These have been added.
- Refactor AudioRecorder for re-entrancy and better error reporting
- Improve GeminiLiveTranscriptionProvider WebSocket handling and safety
- Add type-safe events and path-traversal protection to WhisperModelManager
- Extract voice logic from InputPrompt into useVoiceMode hook
- Relocate and gate voice settings under experimental group
- Refactor voice commands: move /voice-model to /voice model subcommand
- Replace custom CircularProgress with standard CliSpinner
- Update documentation and keyboard shortcuts reference
- Fix InputPrompt unit tests and settings mocks
@Abhijit-2592 Abhijit-2592 force-pushed the abhijit-2592/voice-mode-2 branch from 07bbe96 to 5098f69 Compare April 24, 2026 21:00
@Abhijit-2592 Abhijit-2592 added this pull request to the merge queue Apr 24, 2026
Merged via the queue into main with commit 2e0641c Apr 24, 2026
26 of 27 checks passed
@Abhijit-2592 Abhijit-2592 deleted the abhijit-2592/voice-mode-2 branch April 24, 2026 21:41
kimjune01 pushed a commit to kimjune01/gemini-cli-claude that referenced this pull request May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/core Issues related to User Interface, OS Support, Core Functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Epic: [Voice Mode] Refinement and Polish

6 participants