![]() |
|---|
A privacy-focused speech-to-text keyboard(IME) for Android. Speech recognition runs entirely on-device - no internet needed after the initial model download, no account, no data leaving your phone.
It uses NVIDIA's Parakeet-TDT v3 automatic speech recognition model, quantized to INT8 and run via ONNX Runtime for efficient on-device inference. Voice activity detection uses Silero VAD v4 (also ONNX, also fully on-device) to suppress silence before it ever reaches the ASR model.
![]() |
![]() |
![]() |
|---|---|---|
![]() |
![]() |
- Fully offline after setup - audio is never transmitted anywhere
- Real-time transcription - progressive partial results while you speak
- Works in any app - injects text via Android's standard
InputConnectionAPI - Parakeet-TDT 0.6B v3 - INT8 quantized, ~700 MB, runs on mid-range hardware
- Voice Activity Detection - Silero VAD v4 neural network (ONNX) filters silence before it reaches the ASR model; falls back to energy-threshold VAD if the model can't load
- Configurable trigger modes - hold-to-talk or tap-to-toggle
- Word correction bar - optional suggestion bar that appears after dictation, offering up to 5 on-device correction candidates for the word under the cursor. Uses downloadable language packs (dictionary + bigram language model) for Dutch, English, French, German, Italian, Polish, and Spanish. Language packs are fetched on demand from minburg/outspoke-data — the only external source used at runtime besides the one-time ASR model download. All correction runs entirely on-device once files are downloaded; the feature is opt-in and off by default.
- No Google Play Services, no telemetry, no analytics
| Requirement | Minimum |
|---|---|
| Android version | 11 (API 30) |
| RAM | 4 GB recommended |
| Free storage | ~750 MB (for ASR model files) + up to ~8 MB per language for optional word-correction packs |
| Permissions | RECORD_AUDIO, INTERNET (model download only), POST_NOTIFICATIONS |
The
INTERNETpermission is used for the one-time ASR model download from Hugging Face, and optionally for downloading word-correction language packs (~8 MB per language, only if you enable the feature). After both are downloaded, the keyboard works fully offline.
- Install the APK from Releases or build from source (see below).
- Open the Outspoke app and follow the three setup steps:
- Enable Outspoke in System Settings → Keyboard / Input Methods
- Grant the microphone permission
- Download the model (~700 MB, Wi-Fi recommended)
- Switch to the Outspoke keyboard in any text field and tap the mic button.
Outspoke is structured as a clean layered pipeline. The SpeechEngine interface decouples all inference code from the service and audio layers - adding a new backend means implementing that one interface and nothing else.
┌─────────────────────────────────┐
│ Active App │
│ (Text Field) │
└──────────────┬──────────────────┘
│ InputConnection API
┌──────────────▼──────────────────┐
│ OutspokeInputMethodService │ ← Android IME service
│ (LifecycleOwner + Compose UI) │
│ ┌───────────────────────────┐ │
│ │ KeyboardViewModel │ │ ← UI state + capture lifecycle
│ └───────────────────────────┘ │
└──────────────┬──────────────────┘
│ binds to
┌──────────────▼──────────────────┐
│ InferenceService │ ← Foreground service (keeps engine alive)
│ ┌───────────────────────────┐ │
│ │ InferenceRepository │ │ ← Sliding-window buffer (30 s max)
│ │ ┌─────────────────────┐ │ │
│ │ │ SpeechEngine │ │ │ ← Interface (swap models here)
│ │ │ (ParakeetEngine) │ │ │
│ │ └─────────────────────┘ │ │
│ └───────────────────────────┘ │
└──────────────┬──────────────────┘
│ Flow<AudioChunk>
┌──────────────▼──────────────────┐
│ AudioCaptureManager │ ← 16 kHz / 16-bit / mono PCM
│ SileroVadFilter │ ← Neural VAD (Silero v4, ONNX)
│ RMSVadFilter │ ← Energy-threshold fallback
└─────────────────────────────────┘
| Package | Class | Role |
|---|---|---|
inference |
SpeechEngine |
Interface - model-agnostic contract for loading, transcribing, and closing any ASR engine |
inference |
ParakeetEngine |
Implements SpeechEngine using three ONNX sessions (preprocessor → encoder → decoder/joint) |
inference |
InferenceService |
LifecycleService that owns the engine and exposes InferenceRepository to bound clients |
inference |
InferenceRepository |
Sliding-window inference driver: buffers audio chunks, waits for ≥ 2 s of context, then fires a partial inference every 1 s up to a 30 s hard ceiling; tracks the last 3 partials and performs stable-chunk trims when a common leading-word prefix is confirmed, emitting TranscriptResult.WindowTrimmed to TextInjector; force-trims on divergence loops (> 12 s) and silence runs (2 consecutive blank strides); applies a post-processing pipeline to every raw transcript (filler-word removal, stutter collapse ≥ 3×, phrase-loop deduplication, leading-dot / leading-punct stripping, trailing-dot normalisation, missing sentence-space repair, sentence-boundary capitalisation) |
audio |
AudioCaptureManager |
Opens AudioRecord, emits 40 ms AudioChunks as a cold Flow; drains hardware buffer and VAD hangover on stop |
audio |
VadFilter |
Interface - common contract for VAD implementations (process, flush, isSpeechActive) |
audio |
SileroVadFilter |
Neural VAD using Silero v4 (ONNX); preserves RNN state across chunks; primary filter when model is available |
audio |
RMSVadFilter |
Energy-threshold VAD; used as fallback when Silero ONNX model can't load |
ime |
OutspokeInputMethodService |
Core IME; wires Compose view tree, binds InferenceService, drives capture lifecycle |
ime |
TextInjector |
Writes partial/final text into the focused field via InputConnection; keeps the last 6 words as a mutable composing span (underlined) and permanently freezes earlier words; delegates new-content discovery to TranscriptAligner.findNewContent; on WindowTrimmed performs a three-step reset (commit composing minus last 2 uncertain tail words, clear lastPartial, re-anchor committedWords from the actual field content); two-layer alignment recovery (field-scan → composing-commit fallback) prevents silent word drops on complete divergence |
ime |
TranscriptAligner |
Stateless alignment utilities (normalizeWord, splitToWords, findNewContent); findNewContent uses a three-layer overlap search - (1) full prefix match, (2) suffix-prefix overlap ≥ 2 words, (3) interior scan ≥ 2 words - to locate genuinely new content in a fresh partial relative to already-committed words, tolerating Parakeet attention drift and post-trim leading garbage tokens |
ui |
KeyboardViewModel |
Bridges IME lifecycle, audio capture, and inference results into KeyboardUiState; owns captureJob |
settings |
ModelDownloadManager |
Downloads model files from Hugging Face over OkHttp with SHA-256 verification |
settings |
ModelStorageManager |
Manages model file paths inside filesDir (no external storage permission needed) |
ime/correction |
WordSuggestionProvider |
Public façade for on-device word correction; loads only user-selected languages and delivers results on the main thread |
ime/correction |
WordCorrector |
Orchestrates the correction pipeline: phonetic candidate generation → bigram language model re-ranking |
ime/correction |
SuggestionFileDownloader |
Downloads dictionary + ARPA LM files for a given language from minburg/outspoke-data; supports resumable downloads and SHA-256 verification |
ui/keyboard/components |
SuggestionBar |
Animated chip row that appears after dictation commits, showing up to 5 correction candidates |
- Raw PCM (16-bit signed) is normalised to
float32 [-1, 1] nemo128.onnx- computes 128-dim log-mel spectrogram featuresencoder-model.int8.onnx- FastConformer encoder →[B, 1024, T_enc]decoder_joint-model.int8.onnx- greedy TDT decoding with LSTM state carry-over- Token IDs are mapped to text via
vocab.txt
Partial results are emitted every ~1 s once ≥ 2 s of audio is in the rolling window; the window grows up to a hard 30 s ceiling. After every partial the last 3 results are compared - if their leading words form a stable common prefix, the corresponding audio is trimmed from the front of the window (retaining 4 s of tail context) and TranscriptResult.WindowTrimmed is emitted so TextInjector can re-anchor its alignment state. Silence runs (2 consecutive blank strides) and divergence loops (window > 12 s with no common prefix) trigger unconditional force-trims. Every raw transcript passes through an 8-step post-processing pipeline before emission: filler-word removal → stutter collapse (≥ 3× repeats) → phrase-loop deduplication → leading-dot strip → leading-punct strip → multi-dot normalisation → missing sentence-space repair → sentence-boundary capitalisation. A final inference pass runs over the entire remaining window when recording stops; clips shorter than 1.25 s are zero-padded to give the encoder sufficient frames.
To add a new model backend, implement the SpeechEngine interface:
interface SpeechEngine {
val isLoaded: Boolean
fun load(modelDir: File)
fun transcribe(chunk: AudioChunk): TranscriptResult
fun close()
}To add, for example, a Whisper or Moonshine backend:
- Create a new class implementing
SpeechEngine(e.g.WhisperEngine). - Add a
ModelIdenum value and aModelInfoentry inModelRegistry- this covers display name, download URLs, file list, and size estimate. - Add a branch in
SpeechEngineFactoryto instantiate the new engine for thatModelId.
The repository and IME layers don't need to change.
git clone https://github.com/minburg/outspoke.git
cd outspoke
./gradlew assembleReleaseRequirements: Android Studio Meerkat / Gradle 8+, JDK 11, Android SDK 30-36.
A debug build for sideloading:
./gradlew assembleDebug
# APK: app/build/outputs/apk/debug/app-debug.apk| Permission | Why |
|---|---|
RECORD_AUDIO |
Capturing microphone input for speech recognition |
INTERNET |
One-time ASR model download from Hugging Face (~700 MB); optional word-correction language pack downloads from minburg/outspoke-data (~8 MB per language) |
FOREGROUND_SERVICE + FOREGROUND_SERVICE_MICROPHONE |
Keeping the inference engine alive while the keyboard is in use |
POST_NOTIFICATIONS |
Showing the required foreground service notification |
No permission is used for any purpose beyond what is listed above.
- Audio stays on your device - all recognition runs locally via ONNX Runtime.
- No analytics, crash reporters, or third-party SDKs are included.
- No accounts or sign-in of any kind.
- The only network access is the one-time ASR model download from Hugging Face, and optional word-correction language pack downloads from minburg/outspoke-data. Both can be done manually if preferred (see manual model installation).
- Word-correction language packs are downloaded only when explicitly enabled by the user and only from the project-owned repository above. No data is sent anywhere — correction runs fully on-device.
Bug reports and pull requests are welcome. Please open an issue first for significant changes so we can discuss the approach.
- Follow the existing Kotlin code style (
kotlin.code.style=official) - Keep the
SpeechEngineinterface stable - new engines should be additive - Unit tests for business logic live in
app/src/test/
This project is licensed under the GNU General Public License v3.0. See LICENSE for the full text.
The Parakeet-TDT model weights are distributed separately under CC-BY-4.0 by NVIDIA.






