feat(voice): implement real-time voice mode with cloud and local backends#24174
feat(voice): implement real-time voice mode with cloud and local backends#24174Abhijit-2592 merged 27 commits intomainfrom
Conversation
|
Hi @Abhijit-2592, thank you so much for your contribution to Gemini CLI! We really appreciate the time and effort you've put into this. We're making some updates to our contribution process to improve how we track and review changes. Please take a moment to review our recent discussion post: Improving Our Contribution Process & Introducing New Guidelines. Key Update: Starting January 26, 2026, the Gemini CLI project will require all pull requests to be associated with an existing issue. Any pull requests not linked to an issue by that date will be automatically closed. Thank you for your understanding and for being a part of our community! |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a comprehensive voice interaction system for the Gemini CLI. By supporting both cloud-based and local transcription backends, it provides users with flexible options for dictating prompts. The changes include significant updates to the UI to support voice recording states, new configuration options for managing voice settings, and the necessary backend infrastructure to handle audio streaming and transcription processing. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
Size Change: +40 kB (+0.12%) Total Size: 33.8 MB
ℹ️ View Unchanged
|
There was a problem hiding this comment.
Code Review
This pull request implements a real-time Voice Mode for the Gemini CLI, supporting both cloud-based transcription via the Gemini Live API and local-first transcription using Whisper. Key features include push-to-talk functionality, new slash commands for voice control, and a management system for downloading Whisper models. The review feedback correctly identifies that the API key requirement should be conditional on the selected backend to allow local mode to function independently. Additionally, a critical security vulnerability was found in the binary check utility, which is susceptible to command injection.
0da31ca to
c6c038f
Compare
|
Hi there! Thank you for your interest in contributing to Gemini CLI. To ensure we maintain high code quality and focus on our prioritized roadmap, we have updated our contribution policy (see Discussion #17383). We only guarantee review and consideration of pull requests for issues that are explicitly labeled as 'help wanted'. All other community pull requests are subject to closure after 14 days if they do not align with our current focus areas. For this reason, we strongly recommend that contributors only submit pull requests against issues explicitly labeled as 'help-wanted'. This pull request is being closed as it has been open for 14 days without a 'help wanted' designation. We encourage you to find and contribute to existing 'help wanted' issues in our backlog! Thank you for your understanding and for being part of our community! |
|
Hi there! Thank you for your interest in contributing to Gemini CLI. To ensure we maintain high code quality and focus on our prioritized roadmap, we have updated our contribution policy (see Discussion #17383). We only guarantee review and consideration of pull requests for issues that are explicitly labeled as 'help wanted'. All other community pull requests are subject to closure after 14 days if they do not align with our current focus areas. For this reason, we strongly recommend that contributors only submit pull requests against issues explicitly labeled as 'help-wanted'. This pull request is being closed as it has been open for 14 days without a 'help wanted' designation. We encourage you to find and contribute to existing 'help wanted' issues in our backlog! Thank you for your understanding and for being part of our community! |
7561876 to
aab45dc
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces an experimental voice mode to the Gemini CLI, allowing for both cloud-based transcription via the Gemini Live API and local transcription using Whisper. Key additions include a new voice configuration schema, a VoiceModelDialog for managing backends and models, and an AudioRecorder service. The feedback highlights critical issues regarding resource management, specifically a microphone access conflict when using the Whisper backend and a memory leak in the model download logic. Additionally, the current transcription handling in the UI is identified as a source of potential data loss during interleaved manual typing.
9460f47 to
aee45c1
Compare
|
Hi there! Thank you for your interest in contributing to Gemini CLI. To ensure we maintain high code quality and focus on our prioritized roadmap, we have updated our contribution policy (see Discussion #17383). We only guarantee review and consideration of pull requests for issues that are explicitly labeled as 'help wanted'. All other community pull requests are subject to closure after 14 days if they do not align with our current focus areas. For this reason, we strongly recommend that contributors only submit pull requests against issues explicitly labeled as 'help-wanted'. This pull request is being closed as it has been open for 14 days without a 'help wanted' designation. We encourage you to find and contribute to existing 'help wanted' issues in our backlog! Thank you for your understanding and for being part of our community! |
This comment was marked as spam.
This comment was marked as spam.
- Modified InputPrompt to allow toggling recording even when the buffer is non-empty, enabling seamless voice mode resumption. - Implemented a "moving baseline" strategy that updates the transcription baseline on every 'turnComplete' event. This ensures that new sentences append to previous ones rather than overwriting them, supporting both incremental (Gemini Live) and cumulative (Whisper) providers. - Fixed a race condition by capturing the current buffer text as the baseline immediately upon the user pressing the spacebar to start recording. - Ensured voice mode hints remain visible when the mode is enabled, regardless of whether the buffer contains text. - Added a comprehensive suite of 6 unit tests in InputPrompt.test.tsx covering basic toggle, multi-turn transcription, and session resumption scenarios.
This reverts commit 9833d36.
This commit addresses two critical issues preventing the `sandbox:docker` E2E integration test image from building successfully in the CI environment (specifically after the GitHub Actions runner update to Ubuntu 24.04): 1. **Permission Denied Error (EACCES)**: The `Dockerfile` was copying the `cli` and `core` .tgz packages as `root:root` but executing the subsequent `npm install` as the less privileged `node` user. This resulted in an EACCES permission denied error. The fix updates the `COPY` commands to use `--chown=node:node` to explicitly set ownership during the copy. 2. **Missing Dependencies**: The `gemini --version` sanity check at the end of the Docker build was failing with `ERR_MODULE_NOT_FOUND` because the newly added code for voice mode and file utilities was missing its required dependencies (`command-exists` and `isbinaryfile`) in the `@google/gemini-cli-core` package's `package.json`. These have been added.
- Refactor AudioRecorder for re-entrancy and better error reporting - Improve GeminiLiveTranscriptionProvider WebSocket handling and safety - Add type-safe events and path-traversal protection to WhisperModelManager - Extract voice logic from InputPrompt into useVoiceMode hook - Relocate and gate voice settings under experimental group - Refactor voice commands: move /voice-model to /voice model subcommand - Replace custom CircularProgress with standard CliSpinner - Update documentation and keyboard shortcuts reference - Fix InputPrompt unit tests and settings mocks
…anscription drain
07bbe96 to
5098f69
Compare
Summary
This PR implements a real-time Voice Mode for Gemini CLI, allowing users to dictate prompts directly into the terminal. It supports both cloud-based transcription via the Gemini Live API and local-first transcription via Whisper (using
whisper.cpp).Fixes #24175
Details
~/.gemini/whisper_models/./voiceslash command to toggle voice mode and switch backends./voice modelto manage local Whisper models.spaceto record, release to stop and submit.sox(rec) for cross-platform audio capture.Installation Requirements
To use Voice Mode, you must install the following dependencies:
1. SoX (Sound eXchange)
Required for capturing audio from your microphone.
brew install soxsudo apt install sox libsox-fmt-allsox.exeis in yourPATH.2. whisper-stream (for Local Transcription)
Required only if using the Whisper (Local) backend.
brew install whisper-cpp(The package provides thewhisper-streambinary).streamexample:make stream.streambinary towhisper-streamand move it to a directory in yourPATH.Testing
/voice onin the CLI or toggle it in/settings.space, speak a prompt, and release. The text should appear in the input and be submitted./settings-> Voice -> Transcription Backend -> Gemini Live./settings-> Voice -> Transcription Backend -> Whisper.npm test packages/core/src/voice/liveTranscriptionService.test.tsnpm test integration-tests/voice-mode.test.tsChecklist
npm run preflightand all checks passed.