Portable Windows build of VoxCPM2 — multilingual TTS with Voice Design, Cloning & end-to-end LoRA fine-tuning (video/audio → dataset → training).
🚀 One-click cross-platform install via Pinokio:
![]()
Works on Windows / Linux (x64 & aarch64) / macOS. No
install.batrequired — Pinokio manages CUDA, Python 3.12, PyTorch, Flash-Attn wheels, and FFmpeg for you.Launcher repo: timoncool/VoxCPM2_portable-pinokio
Generate natural multilingual speech, design brand-new voices from text descriptions, clone any voice from a reference clip, and train your own LoRA straight from a video or audio file — upload 8 minutes of a podcast and the app slices it into clips, auto-transcribes, picks the optimal training parameters and starts the run. 100% local, no cloud, no API keys. One-click install on Windows, runs on any NVIDIA GPU with 8+ GB VRAM.
Built on VoxCPM2 by OpenBMB — a tokenizer-free 2B-parameter diffusion autoregressive TTS model trained on 2M+ hours of speech.
- Free forever — no API keys, no credits, no usage limits
- Private — your voice data never leaves your machine
- Portable — everything in one folder, copy to USB, delete = uninstall
- One-click —
install.bat→run.bat→ generate speech - 30 languages — Russian, English, Chinese, French, German, Japanese, Korean and more
- Auto-dataset from video/audio — drop a long file, the app does ffmpeg → ASR → VAD → sentence-aware splitting → optimal params → training, all on its own
- 30 languages — RU/EN/ZH (+9 Chinese dialects)/AR/FR/DE/HI/IT/JA/KO/PT/ES + more — auto language detection
- 48 kHz studio output via AudioVAE V2 super-resolution (16→48 kHz)
- Natural prosody — tokenizer-free diffusion autoregressive architecture
- Output formats: MP3 (default), WAV, FLAC, OGG
- Live-streaming playback — audio starts playing during generation (8 sec prebuffer + 2 sec progressive chunks), no need to wait for full synthesis
- Create voices from text description — gender, age, tone, emotion, pace, accent
- Zero-shot — no reference audio needed
- 6 ready-made examples (EN+RU) with one-click fill
- Clone any voice from 5-50 seconds of reference audio
- Voice pack bundled (~100 voices, RU/EN/FR/DE/JP/KR/AR)
- Extra 743 Russian voices on-demand from
Slait/russia_voices - Style control:
slightly faster, cheerful tone/whispering, intimate/slow and dramatic - Ultimate mode — fill the transcript field → model uses
prompt_wav_path + prompt_text + reference_wav_pathfor max fidelity - Optional ZipEnhancer denoise for noisy references
Train a LoRA on a whole video or podcast in a single click:
- Drop a video/audio file (mp4/mkv/webm/mov/mp3/wav/flac/…)
- ffmpeg extracts audio → 16 kHz mono WAV
- Parakeet TDT 0.6B v3 INT8 ONNX (NVIDIA NeMo, ~670 MB, 25 European languages incl. Russian) + Silero VAD transcribe with word-level timestamps, handles any length
- Words are grouped into whole sentences by punctuation, short ones get merged, long ones split at commas/pauses — never mid-word
- Clips saved to
train_data/<name>/audio/clip_NNNN.wav+transcripts.txt - Auto-tune picks
r / alpha / steps / lrfrom minutes of clean speech (official OpenBMB table) - Training starts in the same click (if the checkbox is on)
Typical result for 8 min of speech: ~69 clips, 86 % extraction rate, ~14 min of training on RTX 4090.
For pre-cleaned datasets:
- Upload 5-50 WAV/MP3 clips (3-15 sec each) + transcripts in
filename.wav|textformat - Defaults match the official OpenBMB YAML (
voxcpm_finetune_lora.yaml):r=32, alpha=32, steps=1000, lr=1e-4 - Live training log, auto-cleanup of old checkpoints on re-runs
- Hot-swap LoRAs across all modes without restart
CFG Scale · Inference Steps · Min/Max length · Retry-on-bad-case · Retry max attempts · Retry ratio threshold · Text normalization (wetext) · Denoise reference · Streaming (live progress) · Seed + Lock
- i18n RU/EN — RU/EN buttons in the header for instant switch
- Dark theme with gradient header
- Bundled FFmpeg portable (for MP3/OGG encoding)
- Auto-download — VoxCPM2 model (~4-5 GB) + voice pack + ASR model (~670 MB, lazy) on first use
- Auto-port, auto-browser — opens on
localhostautomatically
| GPU | Flash Attention 2 | SDPA flash | bfloat16 | AMP (training) | ONNX CUDA (ASR) |
|---|---|---|---|---|---|
| RTX 30xx / 40xx / 50xx | ✅ | ✅ | ✅ | ✅ | ✅ |
| GTX 10xx / RTX 20xx | ❌ | ✅ | ✅ | ✅ | ✅ |
| Component | Minimum | Recommended |
|---|---|---|
| GPU VRAM | 8 GB | 12+ GB |
| RAM | 16 GB | 32 GB |
| Disk | 15 GB | 30 GB (with voice pack & LoRA) |
| OS | Windows 10/11 | Windows 11 |
| GPU | RTX 2080 / RTX 3060 | RTX 4070+ |
CPU-only mode supported but very slow (minutes per phrase).
git clone https://github.com/timoncool/VoxCPM2_portable.git
cd VoxCPM2_portableinstall.bat
Select your GPU type (6 options). Installs portable Python 3.12 + PyTorch 2.7.1 + voxcpm + Flash Attention 2 + FFmpeg + onnx-asr + default voice pack. Nothing system-wide.
run.bat
Browser opens automatically. Model downloads on first run (~4-5 GB to models/). Parakeet ASR (~670 MB) is fetched only when you click Auto-prepare dataset.
| Script | Description |
|---|---|
install.bat |
One-click installer — Python + PyTorch + voxcpm + accelerators + FFmpeg + onnx-asr + voice pack |
run.bat |
Launch Gradio UI with full environment isolation |
update.bat |
Update portable wrapper + voxcpm package |
- Open the LoRA tab
- Enter the Dataset / LoRA name (used for both
train_data/<name>/andlora/<name>/) - Go to the 🎬 Auto sub-tab
- Drop your video or audio file (mp4, mkv, webm, mp3, wav, flac, m4a, ogg, opus, …)
- Leave 🤖 Auto-tune on (picks r / α / steps / lr from minutes of clean speech)
- Check Start training after dataset is ready if you want it to train without a second click
- Hit 🎬 Auto-prepare
The app streams a log with each stage: ffmpeg extract → Parakeet transcription → segmentation → clip save → training. Final LoRA checkpoint lands in lora/<name>/step_XXXX/ and is immediately available in the LoRA dropdown across all tabs.
- Prepare your clips (5-50 WAV/MP3, 3-15 sec each) + transcripts in
filename.wav|textformat, one per line - Switch to the 🎓 Manual sub-tab in the LoRA tab
- Upload files and paste transcripts
- Optionally tweak r / α / steps / lr sliders (defaults are the official OpenBMB values)
- Hit 🎓 Start training
The app computes steps = target_epochs × n_clips / effective_batch, rounded to 50, and picks grad_accum adaptively (4 for < 200 clips, 8 for < 500, 16 for 500+). Table of target epochs:
| Clean speech | target epochs | r / α | lr |
|---|---|---|---|
| < 2 min | 25 (⚠ below OpenBMB minimum) | 32 / 32 | 1e-4 |
| 2-5 min | 20 | 32 / 32 | 1e-4 |
| 5-10 min (sweet spot) | 15 | 32 / 32 | 1e-4 |
| 10-20 min | 12 | 32 / 32 | 1e-4 |
| 20-60 min | 8 | 32 / 32 | 1e-4 |
| 60-120 min | 5 | 64 / 64 | 5e-5 |
| 120+ min | 3 | 64 / 64 | 5e-5 |
For 69 clips / 7.6 min (typical 8 min podcast) → 250 steps → ≈3-4 min of training on RTX 4090.
- In any tab (TTS / Voice Design / Voice Cloning) expand Advanced settings
- Pick your LoRA in the 🧬 LoRA dropdown
- First activation takes ~30-60 sec (model reloads with the LoRA r/α structure)
- Subsequent switches are instant hot-swap
VoxCPM2_portable/
├── app.py # Gradio UI (4 tabs: TTS / Voice Design / Cloning / LoRA)
├── install.bat # GPU selector + installer
├── run.bat # Launcher with env isolation
├── update.bat # Updater
├── requirements.txt # Python dependencies
├── training/
│ ├── scripts/ # Official OpenBMB train & inference scripts (bundled)
│ └── conf/ # YAML config templates
├── python/ # Portable Python 3.12 (created by install.bat)
├── models/ # HuggingFace cache (VoxCPM2 ~4-5 GB, Parakeet ~670 MB, Silero VAD, …)
├── voices/ # Voice pack (bundled default ~100 voices + user downloads)
├── lora/ # Trained LoRA checkpoints (lora/<name>/step_XXXX/)
├── train_data/ # User LoRA datasets (audio + transcripts)
├── ffmpeg/ # Portable FFmpeg (for MP3/OGG encoding)
├── output/ # Generated audio files with timestamps
├── cache/ / temp/ # General cache / tempdir
update.bat
- OpenBMB / VoxCPM — original project
- VoxCPM2 model card — weights
- Demo page with audio samples
- Fine-tuning Guide
- Parakeet TDT 0.6B v3 (multilingual ASR)
- onnx-asr — Python wrapper used for Parakeet
| Project | Description |
|---|---|
| ACE-Step Studio | Local AI music generation studio |
| Foundation Music Lab | Music generation + timeline editor |
| VibeVoice ASR | Speech recognition (ASR) |
| LavaSR | Audio quality enhancement |
| Qwen3-TTS | Text-to-speech by Qwen |
| SuperCaption Qwen3-VL | Image captioning |
| VideoSOS | AI video production |
| RC Stable Audio Tools | Music and audio generation |
Built by Nerual Dreming — founder of ArtGeneration.me, tech-blogger and neuro-evangelist. Channel Нейро-Софт — portable builds of useful AI tools.
- OpenBMB / VoxCPM team — open source VoxCPM2 model
- NVIDIA NeMo / Parakeet TDT — multilingual ASR
- istupakov/onnx-asr — ONNX wrapper for Parakeet + Silero VAD
- Slait/russia_voices — 743 Russian voice presets
- lldacing/flash-attention-windows-wheel — Windows Flash Attention 2 wheels
- Gradio — UI framework
- FFmpeg — audio encoding
Hi! I'm Ilya (Nerual Dreming), and I build AI tools that anyone can run locally — for free, without cloud, without subscriptions. Your donation lets me focus on research and building new open-source projects instead of surviving. Thank you!
All methods (Card / PayPal / Apple Pay) | Monthly on Boosty
- BTC:
1E7dHL22RpyhJGVpcvKdbyZgksSYkYeEBC - ETH (ERC20):
0xb5db65adf478983186d4897ba92fe2c25c594a0c - USDT (TRC20):
TQST9Lp2TjK6FiVkn4fwfGUee7NmkxEE7C