Skip to content

feat(voice): Voice processing module - embeddings, style transfer, cloning, isolation #132

@noahgift

Description

@noahgift

Summary

Add aprender::voice module for voice-specific ML processing, enabling ElevenLabs-style functionality in pure Rust.

Module Structure

aprender/src/voice/
├── mod.rs
├── embedding.rs    # Speaker embeddings (d-vector, x-vector, ECAPA-TDNN)
├── style.rs        # Voice style transfer (OpenVoice-style)
├── clone.rs        # Voice cloning from reference samples
├── conversion.rs   # Voice-to-voice conversion
└── isolation.rs    # Voice isolation / noise removal

Capabilities

Speaker Embeddings

  • Extract speaker identity vector from audio
  • Support multiple embedding types: d-vector, x-vector, ECAPA-TDNN, Resemblyzer
  • Cosine similarity for speaker verification

Voice Style Transfer

  • Transfer prosody/style from one voice to another
  • Preserve content while changing voice characteristics
  • OpenVoice-style architecture

Voice Cloning

  • Create voice clone from short reference samples (3-10 seconds)
  • Zero-shot voice cloning
  • Multi-speaker support

Voice Conversion

  • Real-time voice conversion (voice changer)
  • RVC-style retrieval-based conversion

Voice Isolation

  • Separate voice from background noise
  • Remove music while preserving speech
  • demucs/spleeter-style source separation

API Surface

pub mod embedding {
    pub struct SpeakerEmbedding { pub vector: Vec<f32>, pub model_type: EmbeddingModel }
    pub enum EmbeddingModel { DVector, XVector, ECAPA, Resemblyzer }
    pub fn extract_embedding(samples: &[f32], model: &SpeakerEncoder) -> Result<SpeakerEmbedding>;
    pub fn speaker_similarity(a: &SpeakerEmbedding, b: &SpeakerEmbedding) -> f32;
}

pub mod style {
    pub fn transfer_style(content: &[f32], style: &SpeakerEmbedding, model: &StyleTransferModel) -> Result<Vec<f32>>;
}

pub mod clone {
    pub struct VoiceClone { pub embedding: SpeakerEmbedding, pub style_params: StyleParams }
    pub fn create_clone(reference_samples: &[&[f32]], model: &VoiceCloningModel) -> Result<VoiceClone>;
}

pub mod conversion {
    pub fn convert_voice(source: &[f32], target: &SpeakerEmbedding, model: &VoiceConversionModel) -> Result<Vec<f32>>;
}

pub mod isolation {
    pub fn isolate_voice(audio: &[f32], model: &IsolationModel) -> Result<Vec<f32>>;
    pub fn remove_music(audio: &[f32], model: &IsolationModel) -> Result<Vec<f32>>;
}

Dependencies

  • aprender::audio::mel - mel spectrogram extraction
  • aprender::audio::resample - sample rate conversion
  • trueno - SIMD tensor operations

Implementation Priority

  1. Medium: embedding.rs - foundation for all voice features
  2. Low: style.rs - voice style transfer
  3. Low: clone.rs - voice cloning
  4. Low: conversion.rs - voice conversion
  5. Low: isolation.rs - voice isolation

References

Labels

enhancement, voice, ml

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions