Outspoke

A privacy-focused speech-to-text keyboard(IME) for Android. Speech recognition runs entirely on-device - no internet needed after the initial model download, no account, no data leaving your phone.

It uses NVIDIA's Parakeet-TDT v3 automatic speech recognition model, quantized to INT8 and run via ONNX Runtime for efficient on-device inference. Voice activity detection uses Silero VAD v4 (also ONNX, also fully on-device) to suppress silence before it ever reaches the ASR model.

Screenshots

Features

Fully offline after setup - audio is never transmitted anywhere
Real-time transcription - progressive partial results while you speak
Works in any app - injects text via Android's standard InputConnection API
Parakeet-TDT 0.6B v3 - INT8 quantized, ~700 MB, runs on mid-range hardware
Voice Activity Detection - Silero VAD v4 neural network (ONNX) filters silence before it reaches the ASR model; falls back to energy-threshold VAD if the model can't load
Configurable trigger modes - hold-to-talk or tap-to-toggle
Word correction bar - optional suggestion bar that appears after dictation, offering up to 5 on-device correction candidates for the word under the cursor. Uses downloadable language packs (dictionary + bigram language model) for Dutch, English, French, German, Italian, Polish, and Spanish. Language packs are fetched on demand from minburg/outspoke-data — the only external source used at runtime besides the one-time ASR model download. All correction runs entirely on-device once files are downloaded; the feature is opt-in and off by default.
No Google Play Services, no telemetry, no analytics

Requirements

Requirement	Minimum
Android version	11 (API 30)
RAM	4 GB recommended
Free storage	~750 MB (for ASR model files) + up to ~8 MB per language for optional word-correction packs
Permissions	`RECORD_AUDIO`, `INTERNET` (model download only), `POST_NOTIFICATIONS`

The INTERNET permission is used for the one-time ASR model download from Hugging Face, and optionally for downloading word-correction language packs (~8 MB per language, only if you enable the feature). After both are downloaded, the keyboard works fully offline.

Getting Started

Install the APK from Releases or build from source (see below).
Open the Outspoke app and follow the three setup steps:
- Enable Outspoke in System Settings → Keyboard / Input Methods
- Grant the microphone permission
- Download the model (~700 MB, Wi-Fi recommended)
Switch to the Outspoke keyboard in any text field and tap the mic button.

Architecture

Outspoke is structured as a clean layered pipeline. The SpeechEngine interface decouples all inference code from the service and audio layers - adding a new backend means implementing that one interface and nothing else.

┌─────────────────────────────────┐
│         Active App              │
│         (Text Field)            │
└──────────────┬──────────────────┘
               │  InputConnection API
┌──────────────▼──────────────────┐
│  OutspokeInputMethodService     │  ← Android IME service
│  (LifecycleOwner + Compose UI)  │
│  ┌───────────────────────────┐  │
│  │    KeyboardViewModel      │  │  ← UI state + capture lifecycle
│  └───────────────────────────┘  │
└──────────────┬──────────────────┘
               │  binds to
┌──────────────▼──────────────────┐
│       InferenceService          │  ← Foreground service (keeps engine alive)
│  ┌───────────────────────────┐  │
│  │    InferenceRepository    │  │  ← Sliding-window buffer (30 s max)
│  │  ┌─────────────────────┐  │  │
│  │  │   SpeechEngine      │  │  │  ← Interface (swap models here)
│  │  │   (ParakeetEngine)  │  │  │
│  │  └─────────────────────┘  │  │
│  └───────────────────────────┘  │
└──────────────┬──────────────────┘
               │  Flow<AudioChunk>
┌──────────────▼──────────────────┐
│    AudioCaptureManager          │  ← 16 kHz / 16-bit / mono PCM
│    SileroVadFilter              │  ← Neural VAD (Silero v4, ONNX)
│    RMSVadFilter                 │  ← Energy-threshold fallback
└─────────────────────────────────┘

Key components

Package	Class	Role
`inference`	`SpeechEngine`	Interface - model-agnostic contract for loading, transcribing, and closing any ASR engine
`inference`	`ParakeetEngine`	Implements `SpeechEngine` using three ONNX sessions (preprocessor → encoder → decoder/joint)
`inference`	`InferenceService`	`LifecycleService` that owns the engine and exposes `InferenceRepository` to bound clients
`inference`	`InferenceRepository`	Sliding-window inference driver: buffers audio chunks, waits for ≥ 2 s of context, then fires a partial inference every 1 s up to a 30 s hard ceiling; tracks the last 3 partials and performs stable-chunk trims when a common leading-word prefix is confirmed, emitting `TranscriptResult.WindowTrimmed` to `TextInjector`; force-trims on divergence loops (> 12 s) and silence runs (2 consecutive blank strides); applies a post-processing pipeline to every raw transcript (filler-word removal, stutter collapse ≥ 3×, phrase-loop deduplication, leading-dot / leading-punct stripping, trailing-dot normalisation, missing sentence-space repair, sentence-boundary capitalisation)
`audio`	`AudioCaptureManager`	Opens `AudioRecord`, emits 40 ms `AudioChunk`s as a cold `Flow`; drains hardware buffer and VAD hangover on stop
`audio`	`VadFilter`	Interface - common contract for VAD implementations (process, flush, isSpeechActive)
`audio`	`SileroVadFilter`	Neural VAD using Silero v4 (ONNX); preserves RNN state across chunks; primary filter when model is available
`audio`	`RMSVadFilter`	Energy-threshold VAD; used as fallback when Silero ONNX model can't load
`ime`	`OutspokeInputMethodService`	Core IME; wires Compose view tree, binds `InferenceService`, drives capture lifecycle
`ime`	`TextInjector`	Writes partial/final text into the focused field via `InputConnection`; keeps the last 6 words as a mutable composing span (underlined) and permanently freezes earlier words; delegates new-content discovery to `TranscriptAligner.findNewContent`; on `WindowTrimmed` performs a three-step reset (commit composing minus last 2 uncertain tail words, clear `lastPartial`, re-anchor `committedWords` from the actual field content); two-layer alignment recovery (field-scan → composing-commit fallback) prevents silent word drops on complete divergence
`ime`	`TranscriptAligner`	Stateless alignment utilities (`normalizeWord`, `splitToWords`, `findNewContent`); `findNewContent` uses a three-layer overlap search - (1) full prefix match, (2) suffix-prefix overlap ≥ 2 words, (3) interior scan ≥ 2 words - to locate genuinely new content in a fresh partial relative to already-committed words, tolerating Parakeet attention drift and post-trim leading garbage tokens
`ui`	`KeyboardViewModel`	Bridges IME lifecycle, audio capture, and inference results into `KeyboardUiState`; owns `captureJob`
`settings`	`ModelDownloadManager`	Downloads model files from Hugging Face over OkHttp with SHA-256 verification
`settings`	`ModelStorageManager`	Manages model file paths inside `filesDir` (no external storage permission needed)
`ime/correction`	`WordSuggestionProvider`	Public façade for on-device word correction; loads only user-selected languages and delivers results on the main thread
`ime/correction`	`WordCorrector`	Orchestrates the correction pipeline: phonetic candidate generation → bigram language model re-ranking
`ime/correction`	`SuggestionFileDownloader`	Downloads dictionary + ARPA LM files for a given language from minburg/outspoke-data; supports resumable downloads and SHA-256 verification
`ui/keyboard/components`	`SuggestionBar`	Animated chip row that appears after dictation commits, showing up to 5 correction candidates

Inference pipeline (Parakeet-TDT v3)

Raw PCM (16-bit signed) is normalised to float32 [-1, 1]
nemo128.onnx - computes 128-dim log-mel spectrogram features
encoder-model.int8.onnx - FastConformer encoder → [B, 1024, T_enc]
decoder_joint-model.int8.onnx - greedy TDT decoding with LSTM state carry-over
Token IDs are mapped to text via vocab.txt

Partial results are emitted every ~1 s once ≥ 2 s of audio is in the rolling window; the window grows up to a hard 30 s ceiling. After every partial the last 3 results are compared - if their leading words form a stable common prefix, the corresponding audio is trimmed from the front of the window (retaining 4 s of tail context) and TranscriptResult.WindowTrimmed is emitted so TextInjector can re-anchor its alignment state. Silence runs (2 consecutive blank strides) and divergence loops (window > 12 s with no common prefix) trigger unconditional force-trims. Every raw transcript passes through an 8-step post-processing pipeline before emission: filler-word removal → stutter collapse (≥ 3× repeats) → phrase-loop deduplication → leading-dot strip → leading-punct strip → multi-dot normalisation → missing sentence-space repair → sentence-boundary capitalisation. A final inference pass runs over the entire remaining window when recording stops; clips shorter than 1.25 s are zero-padded to give the encoder sufficient frames.

Adding a New Model

To add a new model backend, implement the SpeechEngine interface:

interface SpeechEngine {
    val isLoaded: Boolean
    fun load(modelDir: File)
    fun transcribe(chunk: AudioChunk): TranscriptResult
    fun close()
}

To add, for example, a Whisper or Moonshine backend:

Create a new class implementing SpeechEngine (e.g. WhisperEngine).
Add a ModelId enum value and a ModelInfo entry in ModelRegistry - this covers display name, download URLs, file list, and size estimate.
Add a branch in SpeechEngineFactory to instantiate the new engine for that ModelId.

The repository and IME layers don't need to change.

Building from Source

git clone https://github.com/minburg/outspoke.git
cd outspoke
./gradlew assembleRelease

Requirements: Android Studio Meerkat / Gradle 8+, JDK 11, Android SDK 30-36.

A debug build for sideloading:

./gradlew assembleDebug
# APK: app/build/outputs/apk/debug/app-debug.apk

Permissions

Permission	Why
`RECORD_AUDIO`	Capturing microphone input for speech recognition
`INTERNET`	One-time ASR model download from Hugging Face (~700 MB); optional word-correction language pack downloads from minburg/outspoke-data (~8 MB per language)
`FOREGROUND_SERVICE` + `FOREGROUND_SERVICE_MICROPHONE`	Keeping the inference engine alive while the keyboard is in use
`POST_NOTIFICATIONS`	Showing the required foreground service notification

No permission is used for any purpose beyond what is listed above.

Privacy

Audio stays on your device - all recognition runs locally via ONNX Runtime.
No analytics, crash reporters, or third-party SDKs are included.
No accounts or sign-in of any kind.
The only network access is the one-time ASR model download from Hugging Face, and optional word-correction language pack downloads from minburg/outspoke-data. Both can be done manually if preferred (see manual model installation).
Word-correction language packs are downloaded only when explicitly enabled by the user and only from the project-owned repository above. No data is sent anywhere — correction runs fully on-device.

Contributing

Bug reports and pull requests are welcome. Please open an issue first for significant changes so we can discuss the approach.

Follow the existing Kotlin code style (kotlin.code.style=official)
Keep the SpeechEngine interface stable - new engines should be additive
Unit tests for business logic live in app/src/test/

License

This project is licensed under the GNU General Public License v3.0. See LICENSE for the full text.

The Parakeet-TDT model weights are distributed separately under CC-BY-4.0 by NVIDIA.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
app		app
devtools		devtools
docs		docs
fastlane/metadata/android/en-US		fastlane/metadata/android/en-US
gradle/wrapper		gradle/wrapper
metadata		metadata
.gitignore		.gitignore
.opencode.json		.opencode.json
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
how-to-release.txt		how-to-release.txt
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Outspoke

Screenshots

Features

Requirements

Getting Started

Architecture

Key components

Inference pipeline (Parakeet-TDT v3)

Adding a New Model

Building from Source

Permissions

Privacy

Contributing

License

About

Uh oh!

Releases 8

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Outspoke

Screenshots

Features

Requirements

Getting Started

Architecture

Key components

Inference pipeline (Parakeet-TDT v3)

Adding a New Model

Building from Source

Permissions

Privacy

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages