VoxCPM2 Portable

Portable Windows build of VoxCPM2 — multilingual TTS with Voice Design, Cloning & end-to-end LoRA fine-tuning (video/audio → dataset → training).

Русская версия

🚀 One-click cross-platform install via Pinokio:

Works on Windows / Linux (x64 & aarch64) / macOS. No install.bat required — Pinokio manages CUDA, Python 3.12, PyTorch, Flash-Attn wheels, and FFmpeg for you.

Launcher repo: timoncool/VoxCPM2_portable-pinokio

Generate natural multilingual speech, design brand-new voices from text descriptions, clone any voice from a reference clip, and train your own LoRA straight from a video or audio file — upload 8 minutes of a podcast and the app slices it into clips, auto-transcribes, picks the optimal training parameters and starts the run. 100% local, no cloud, no API keys. One-click install on Windows, runs on any NVIDIA GPU with 8+ GB VRAM.

Built on VoxCPM2 by OpenBMB — a tokenizer-free 2B-parameter diffusion autoregressive TTS model trained on 2M+ hours of speech.

Why VoxCPM2 Portable?

Free forever — no API keys, no credits, no usage limits
Private — your voice data never leaves your machine
Portable — everything in one folder, copy to USB, delete = uninstall
One-click — install.bat → run.bat → generate speech
30 languages — Russian, English, Chinese, French, German, Japanese, Korean and more
Auto-dataset from video/audio — drop a long file, the app does ffmpeg → ASR → VAD → sentence-aware splitting → optimal params → training, all on its own

Features

Text-to-Speech

30 languages — RU/EN/ZH (+9 Chinese dialects)/AR/FR/DE/HI/IT/JA/KO/PT/ES + more — auto language detection
48 kHz studio output via AudioVAE V2 super-resolution (16→48 kHz)
Natural prosody — tokenizer-free diffusion autoregressive architecture
Output formats: MP3 (default), WAV, FLAC, OGG
Live-streaming playback — audio starts playing during generation (8 sec prebuffer + 2 sec progressive chunks), no need to wait for full synthesis

Voice Design

Create voices from text description — gender, age, tone, emotion, pace, accent
Zero-shot — no reference audio needed
6 ready-made examples (EN+RU) with one-click fill

Voice Cloning (with optional Ultimate mode)

Clone any voice from 5-50 seconds of reference audio
Voice pack bundled (~100 voices, RU/EN/FR/DE/JP/KR/AR)
Extra 743 Russian voices on-demand from Slait/russia_voices
Style control: slightly faster, cheerful tone / whispering, intimate / slow and dramatic
Ultimate mode — fill the transcript field → model uses prompt_wav_path + prompt_text + reference_wav_path for max fidelity
Optional ZipEnhancer denoise for noisy references

LoRA Fine-Tuning — Auto Pipeline 🎬

Train a LoRA on a whole video or podcast in a single click:

Drop a video/audio file (mp4/mkv/webm/mov/mp3/wav/flac/…)
ffmpeg extracts audio → 16 kHz mono WAV
Parakeet TDT 0.6B v3 INT8 ONNX (NVIDIA NeMo, ~670 MB, 25 European languages incl. Russian) + Silero VAD transcribe with word-level timestamps, handles any length
Words are grouped into whole sentences by punctuation, short ones get merged, long ones split at commas/pauses — never mid-word
Clips saved to train_data/<name>/audio/clip_NNNN.wav + transcripts.txt
Auto-tune picks r / alpha / steps / lr from minutes of clean speech (official OpenBMB table)
Training starts in the same click (if the checkbox is on)

Typical result for 8 min of speech: ~69 clips, 86 % extraction rate, ~14 min of training on RTX 4090.

LoRA Fine-Tuning — Manual Mode

For pre-cleaned datasets:

Upload 5-50 WAV/MP3 clips (3-15 sec each) + transcripts in filename.wav|text format
Defaults match the official OpenBMB YAML (voxcpm_finetune_lora.yaml): r=32, alpha=32, steps=1000, lr=1e-4
Live training log, auto-cleanup of old checkpoints on re-runs
Hot-swap LoRAs across all modes without restart

All model parameters exposed

CFG Scale · Inference Steps · Min/Max length · Retry-on-bad-case · Retry max attempts · Retry ratio threshold · Text normalization (wetext) · Denoise reference · Streaming (live progress) · Seed + Lock

Interface

i18n RU/EN — RU/EN buttons in the header for instant switch
Dark theme with gradient header
Bundled FFmpeg portable (for MP3/OGG encoding)
Auto-download — VoxCPM2 model (~4-5 GB) + voice pack + ASR model (~670 MB, lazy) on first use
Auto-port, auto-browser — opens on localhost automatically

GPU Accelerators (out of the box)

GPU	Flash Attention 2	SDPA flash	bfloat16	AMP (training)	ONNX CUDA (ASR)
RTX 30xx / 40xx / 50xx	✅	✅	✅	✅	✅
GTX 10xx / RTX 20xx	❌	✅	✅	✅	✅

System Requirements

Component	Minimum	Recommended
GPU VRAM	8 GB	12+ GB
RAM	16 GB	32 GB
Disk	15 GB	30 GB (with voice pack & LoRA)
OS	Windows 10/11	Windows 11
GPU	RTX 2080 / RTX 3060	RTX 4070+

CPU-only mode supported but very slow (minutes per phrase).

Quick Start

1. Clone

git clone https://github.com/timoncool/VoxCPM2_portable.git
cd VoxCPM2_portable

2. Install

install.bat

Select your GPU type (6 options). Installs portable Python 3.12 + PyTorch 2.7.1 + voxcpm + Flash Attention 2 + FFmpeg + onnx-asr + default voice pack. Nothing system-wide.

3. Run

run.bat

Browser opens automatically. Model downloads on first run (~4-5 GB to models/). Parakeet ASR (~670 MB) is fetched only when you click Auto-prepare dataset.

Launchers

Script	Description
`install.bat`	One-click installer — Python + PyTorch + voxcpm + accelerators + FFmpeg + onnx-asr + voice pack
`run.bat`	Launch Gradio UI with full environment isolation
`update.bat`	Update portable wrapper + voxcpm package

LoRA Training — Full Guide

Option A — Auto from video/audio (recommended)

Open the LoRA tab
Enter the Dataset / LoRA name (used for both train_data/<name>/ and lora/<name>/)
Go to the 🎬 Auto sub-tab
Drop your video or audio file (mp4, mkv, webm, mp3, wav, flac, m4a, ogg, opus, …)
Leave 🤖 Auto-tune on (picks r / α / steps / lr from minutes of clean speech)
Check Start training after dataset is ready if you want it to train without a second click
Hit 🎬 Auto-prepare

The app streams a log with each stage: ffmpeg extract → Parakeet transcription → segmentation → clip save → training. Final LoRA checkpoint lands in lora/<name>/step_XXXX/ and is immediately available in the LoRA dropdown across all tabs.

Option B — Manual

Prepare your clips (5-50 WAV/MP3, 3-15 sec each) + transcripts in filename.wav|text format, one per line
Switch to the 🎓 Manual sub-tab in the LoRA tab
Upload files and paste transcripts
Optionally tweak r / α / steps / lr sliders (defaults are the official OpenBMB values)
Hit 🎓 Start training

Recommended settings — auto-tuned from minutes of clean speech

The app computes steps = target_epochs × n_clips / effective_batch, rounded to 50, and picks grad_accum adaptively (4 for < 200 clips, 8 for < 500, 16 for 500+). Table of target epochs:

Clean speech	target epochs	r / α	lr
< 2 min	25 (⚠ below OpenBMB minimum)	32 / 32	1e-4
2-5 min	20	32 / 32	1e-4
5-10 min (sweet spot)	15	32 / 32	1e-4
10-20 min	12	32 / 32	1e-4
20-60 min	8	32 / 32	1e-4
60-120 min	5	64 / 64	5e-5
120+ min	3	64 / 64	5e-5

For 69 clips / 7.6 min (typical 8 min podcast) → 250 steps → ≈3-4 min of training on RTX 4090.

Activating a trained LoRA

In any tab (TTS / Voice Design / Voice Cloning) expand Advanced settings
Pick your LoRA in the 🧬 LoRA dropdown
First activation takes ~30-60 sec (model reloads with the LoRA r/α structure)
Subsequent switches are instant hot-swap

Architecture

VoxCPM2_portable/
├── app.py              # Gradio UI (4 tabs: TTS / Voice Design / Cloning / LoRA)
├── install.bat         # GPU selector + installer
├── run.bat             # Launcher with env isolation
├── update.bat          # Updater
├── requirements.txt    # Python dependencies
├── training/
│   ├── scripts/        # Official OpenBMB train & inference scripts (bundled)
│   └── conf/           # YAML config templates
├── python/             # Portable Python 3.12 (created by install.bat)
├── models/             # HuggingFace cache (VoxCPM2 ~4-5 GB, Parakeet ~670 MB, Silero VAD, …)
├── voices/             # Voice pack (bundled default ~100 voices + user downloads)
├── lora/               # Trained LoRA checkpoints (lora/<name>/step_XXXX/)
├── train_data/         # User LoRA datasets (audio + transcripts)
├── ffmpeg/             # Portable FFmpeg (for MP3/OGG encoding)
├── output/             # Generated audio files with timestamps
├── cache/ / temp/      # General cache / tempdir

Updating

update.bat

Links

OpenBMB / VoxCPM — original project
VoxCPM2 model card — weights
Demo page with audio samples
Fine-tuning Guide
Parakeet TDT 0.6B v3 (multilingual ASR)
onnx-asr — Python wrapper used for Parakeet

Other Portable Neural Networks

Project	Description
ACE-Step Studio	Local AI music generation studio
Foundation Music Lab	Music generation + timeline editor
VibeVoice ASR	Speech recognition (ASR)
LavaSR	Audio quality enhancement
Qwen3-TTS	Text-to-speech by Qwen
SuperCaption Qwen3-VL	Image captioning
VideoSOS	AI video production
RC Stable Audio Tools	Music and audio generation

Authors

Built by Nerual Dreming — founder of ArtGeneration.me, tech-blogger and neuro-evangelist. Channel Нейро-Софт — portable builds of useful AI tools.

Acknowledgments

OpenBMB / VoxCPM team — open source VoxCPM2 model
NVIDIA NeMo / Parakeet TDT — multilingual ASR
istupakov/onnx-asr — ONNX wrapper for Parakeet + Silero VAD
Slait/russia_voices — 743 Russian voice presets
lldacing/flash-attention-windows-wheel — Windows Flash Attention 2 wheels
Gradio — UI framework
FFmpeg — audio encoding

Support Open-Source AI

Hi! I'm Ilya (Nerual Dreming), and I build AI tools that anyone can run locally — for free, without cloud, without subscriptions. Your donation lets me focus on research and building new open-source projects instead of surviving. Thank you!

All methods (Card / PayPal / Apple Pay) | Monthly on Boosty

BTC: 1E7dHL22RpyhJGVpcvKdbyZgksSYkYeEBC
ETH (ERC20): 0xb5db65adf478983186d4897ba92fe2c25c594a0c
USDT (TRC20): TQST9Lp2TjK6FiVkn4fwfGUee7NmkxEE7C

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
docs		docs
training		training
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
README_RU.md		README_RU.md
app.py		app.py
install.bat		install.bat
requirements.txt		requirements.txt
run.bat		run.bat
update.bat		update.bat

Folders and files

Latest commit

History

Repository files navigation

VoxCPM2 Portable

Why VoxCPM2 Portable?

Features

Text-to-Speech

Voice Design

Voice Cloning (with optional Ultimate mode)

LoRA Fine-Tuning — Auto Pipeline 🎬

LoRA Fine-Tuning — Manual Mode

All model parameters exposed

Interface

GPU Accelerators (out of the box)

System Requirements

Quick Start

1. Clone

2. Install

3. Run

Launchers

LoRA Training — Full Guide

Option A — Auto from video/audio (recommended)

Option B — Manual

Recommended settings — auto-tuned from minutes of clean speech

Activating a trained LoRA

Architecture

Updating

Links

Other Portable Neural Networks

Authors

Acknowledgments

Support Open-Source AI

Star History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages