Summary
Add aprender::voice module for voice-specific ML processing, enabling ElevenLabs-style functionality in pure Rust.
Module Structure
aprender/src/voice/
├── mod.rs
├── embedding.rs # Speaker embeddings (d-vector, x-vector, ECAPA-TDNN)
├── style.rs # Voice style transfer (OpenVoice-style)
├── clone.rs # Voice cloning from reference samples
├── conversion.rs # Voice-to-voice conversion
└── isolation.rs # Voice isolation / noise removal
Capabilities
Speaker Embeddings
- Extract speaker identity vector from audio
- Support multiple embedding types: d-vector, x-vector, ECAPA-TDNN, Resemblyzer
- Cosine similarity for speaker verification
Voice Style Transfer
- Transfer prosody/style from one voice to another
- Preserve content while changing voice characteristics
- OpenVoice-style architecture
Voice Cloning
- Create voice clone from short reference samples (3-10 seconds)
- Zero-shot voice cloning
- Multi-speaker support
Voice Conversion
- Real-time voice conversion (voice changer)
- RVC-style retrieval-based conversion
Voice Isolation
- Separate voice from background noise
- Remove music while preserving speech
- demucs/spleeter-style source separation
API Surface
pub mod embedding {
pub struct SpeakerEmbedding { pub vector: Vec<f32>, pub model_type: EmbeddingModel }
pub enum EmbeddingModel { DVector, XVector, ECAPA, Resemblyzer }
pub fn extract_embedding(samples: &[f32], model: &SpeakerEncoder) -> Result<SpeakerEmbedding>;
pub fn speaker_similarity(a: &SpeakerEmbedding, b: &SpeakerEmbedding) -> f32;
}
pub mod style {
pub fn transfer_style(content: &[f32], style: &SpeakerEmbedding, model: &StyleTransferModel) -> Result<Vec<f32>>;
}
pub mod clone {
pub struct VoiceClone { pub embedding: SpeakerEmbedding, pub style_params: StyleParams }
pub fn create_clone(reference_samples: &[&[f32]], model: &VoiceCloningModel) -> Result<VoiceClone>;
}
pub mod conversion {
pub fn convert_voice(source: &[f32], target: &SpeakerEmbedding, model: &VoiceConversionModel) -> Result<Vec<f32>>;
}
pub mod isolation {
pub fn isolate_voice(audio: &[f32], model: &IsolationModel) -> Result<Vec<f32>>;
pub fn remove_music(audio: &[f32], model: &IsolationModel) -> Result<Vec<f32>>;
}
Dependencies
aprender::audio::mel - mel spectrogram extraction
aprender::audio::resample - sample rate conversion
trueno - SIMD tensor operations
Implementation Priority
- Medium:
embedding.rs - foundation for all voice features
- Low:
style.rs - voice style transfer
- Low:
clone.rs - voice cloning
- Low:
conversion.rs - voice conversion
- Low:
isolation.rs - voice isolation
References
Labels
enhancement, voice, ml
Summary
Add
aprender::voicemodule for voice-specific ML processing, enabling ElevenLabs-style functionality in pure Rust.Module Structure
Capabilities
Speaker Embeddings
Voice Style Transfer
Voice Cloning
Voice Conversion
Voice Isolation
API Surface
Dependencies
aprender::audio::mel- mel spectrogram extractionaprender::audio::resample- sample rate conversiontrueno- SIMD tensor operationsImplementation Priority
embedding.rs- foundation for all voice featuresstyle.rs- voice style transferclone.rs- voice cloningconversion.rs- voice conversionisolation.rs- voice isolationReferences
Labels
enhancement,voice,ml