ADVANCED Audio: Voice Cloning, Transcription, Speech-to-Speech, Diarization and Turn Detection

December 17, 2025
61 Comments

Features:

Detailed fine-tuning and inference scripts and guides for audio models.
Scripts pair with videos on the Trelis YouTube channel.
Save substantial time fine-tuning and inferencing audio models.

Repo Content:

Speech-to-Text Transcription:
- Fine-tuning and Serving Whisper Turbo (now with Unsloth!):
  - Transcription of audio files (mp3, wav, m4a) and Youtube audio
  - Fine-tuning, incl. dataset preparation and automated cleaning (useful for adding support for uncommon/new words or phrases, improving performance on specific accents, improving performance on specific languages)
  - Server Setup using a Faster Whisper Server
- Serving Moshi Speech to Text Streaming Service
Text-to-Speech / Voice Cloning:
- Advanced data cleaning and chunking techniques!
- Kokoro Server Setup (and local setup)
- Fine-tuning, Inferencing, Voice Cloning of CSM-1B
- Fine-tuning, Inferencing, Voice Cloning and *vLLM Serving* of Orpheus
- Fine-tuning StyleTTS2 [Purchase individually here]
- Create realistic voice-overs via voice cloning.
- Replicate difficult accents via fine-tuning.
- PLUS inference Melo TTS on a remote GPU (with TCP or low latency UDP ports).
Moshi – Full Duplex Audio Input + Output Model:
- Inference scripts for Mac M2/3/4 or on GPU (see “omni” branch)
- Data prep and fine-tuning scripts – coming soon
Qwen2.5 Omni – Video, Image, Audio, Text Input & Audio + Text Output
- Inference Scripts
- Fine-tuning scripts not included (although available from LlamaFactory)
Diarization and Turn-Detection (for real-time voice) [Buy only these scripts]
Multi-modal Audio + Text Models (Qwen 2 Audio):
- Dataset preparation and fine-tuning
- Continuously batched inference using vLLM
Speech-to-Speech
- Run HuggingFace’s speech-to-speech model on Mac
- Run speech-to-speech on a remote GPU
Text-to-Speech / Voice Cloning [Unsloth now recommended, see first bullet above]
- Inferencing and Voice Cloning of CSM-1B
- Inferencing, Voice Cloning and Fine-tuning of Orpheus
- Fine-tuning StyleTTS2 [Purchase individually here]
- Create realistic voice-overs via voice cloning.
- Replicate difficult accents via fine-tuning.
- PLUS inference Melo TTS on a remote GPU (with TCP or low latency UDP ports).

Purchase Options

Audio Repo Access (Join 277+ members)

LIFETIME access, plus support via Github Issues, for one individual:

Lifetime Audio Repo Access

For TEAM access (i.e. repo used by multiple engineers), kindly post a comment below. It will be submitted privately and I’ll respond by email.

Trelis Multi-Repo Bundle

LIFETIME access to all SEVEN Trelis Repos (Voice, Vision, Fine-tuning (LLMs), Inference, Evals and Time-Series, Robotics) + support via Github Issues PLUS Trelis Discord Access for one individual.

Lifetime Multi-Repo Bundle Access

Want lifetime access for a team? Purchase lifetime access here.

Testimonials from Youtube Comments

“Awesome content. Thank you! Purchased your gated repo and learning loads from it.”
“Thank you! It’s a great investment :)”
“I had Whisper&CoLab running a few months ago, but it broke. Your video and notebooks showed me why, and taught me several new tricks! Keep it up please.”

Repo Screenshots

Frequently Asked Questions

Q: Can I get a formal invoice with my tax number so I can reimburse this expense?

Yes, purchases come with an invoice and receipt of payment. You can include your tax/VAT number (not applicable for personal purchases or those using the LEARN discount). If you require modifications to the invoice, just respond to your email receipt.

Q: Is support included in my repo purchase?

Yes, if you purchase repo access, you will be able to post Issues in the corresponding Github repo.

Q: Can I buy individual scripts for one video?

Yes, if you only wish to purchase scripts for one video, you can post a comment and I will add a purchase link. Purchases of individual scripts do not provide GitHub repo access or the ability to post issues to get support.

Q: Is there a student discount available?

Yes. Use LEARN2025 at checkout if EITHER of the following applies:
a) You are currently a registered student at an academic institution.
b) You will use these materials only for non-commercial purposes.

After your purchase, reply to your receipt with a quick note indicating whether a or b applies. In the case of a, kindly provide a link to your LinkedIn page. Purchases using a company email/tax number/Github are not eligible for this discount. This discount applies to repos or the repo bundle, but not to individual scripts.

Video Tutorials

Whisper Data Preparation and Fine-tuning with Unsloth

Kokoro Text to Speech Server

Streaming Text to Speech Models

Professional Quality Voice Cloning – Open Source vs ElevenLabs

Voice Detection, Turn Detection and Diarization

Multi modal Audio + Text Fine tuning and Inference with Qwen

Whisper Turbo Fine-tuning and API Setup (Speech to Text)

Speech to Speech (HuggingFace approach)

Text to Speech (Fine-tuning StyleTTS2)

Speech to Text (Fine-tuning Whisper)

Tags:ASR Speech-to-text Transcription Whisper

61 thoughts on “ADVANCED Audio: Voice Cloning, Transcription, Speech-to-Speech, Diarization and Turn Detection”

Pedro February 11, 2024 at 2:04 pm
Reply
Hi my friend!
My name is Pedro Henrique, and I am a Brazil-based radiologist interested in enhancing the efficiency of imaging diagnostics through advanced technology. I came across your work with the Whisperer AI and became excited about the possibility of applying a customized model in my daily practice. The aim is to use automated transcription to improve the documentation of diagnoses without any commercial intent.
I possess basic to intermediate knowledge in Python and JavaScript, primarily used to develop personal scripts that facilitate my professional activities. Your expertise in customizing the Whisperer AI, especially in fine tuning, could be precisely what I need to integrate this innovation into my work.
Could you provide details about how your product can be adapted to a radiologist’s needs? Given my limited experience with fine tuning, is it feasible for me to apply your customized scripts and achieve an effective model for transcription in radiology?
Furthermore, I am interested in learning more about post-purchase support. How are inquiries and technical assistance managed after buying the product? This information will be crucial to ensure a smooth and efficient transition to using this technology in my work environment.
I am confident that your solution could represent a significant advancement in my medical practice. I look forward to your response and am available to discuss any additional details.
Sincerely,
Pedro Henrique
Radiologist
1. Ronan | Trelis February 11, 2024 at 5:27 pm
  Reply
  Hi Pedro,
  I suppose you could fine-tune a whisper model with a recording of some key radiologist terminology. You would need to record yourself (or get a recording) and then follow the youtube video demo along with the scripts in the repository.
  If you have issues after purchase, you can post an “Issue” in the GitHub repository – I typically reply back within a few days with my suggestions.
2. Gabriel July 31, 2024 at 4:52 pm
  Reply
  Hello, I have a question: after I generate the pre-trained whisper in hugging face model card, can I host it on Azure to consume my credits directly from Microsoft, in this case the output files model?
3. 1. Ronan | Trelis August 1, 2024 at 2:00 pm
    Reply
    In principle if you have a rented gpu on Azure you can run the model, yes.
    The problem with Azure is often securing access to a model.
    You also need to set up the server ports if you want to expose an API and you need to look at azure docs for that.
John March 1, 2024 at 5:58 pm
Reply
Hi there,
I don’t have enough money to purchase your repo, and I am student by profession and I am keen to learn finetuning STT model.
Kind Help me with some exceptions, please🥺🥺
1. Ronan | Trelis March 2, 2024 at 11:56 am
  Reply
  Howdy, my recc is to go through the youtube.com/@trelisresearch channel. By following along and using the free materials I link, you’ll be able to learn. This will take more time than purchasing the repo, but you’ll get a deeper understanding and that will be great for you as a student. I’m wishing you the best!
Emin March 13, 2024 at 1:18 pm
Reply
After I buy can I use your repository in my own jupyter notebook? For fine tune my own language from 0-Z?
1. Ronan | Trelis March 14, 2024 at 9:02 am
  Reply
  yes, you can run the ipynb files in jupyter on your computer.
Em March 14, 2024 at 4:11 pm
Reply
I have a few questions about speech to text, please help me answer them.
1. I used orjinal Whisper for the Azerbaijani language, but in most cases it does not speech to text some words correctly. Can it be improved with fine-tune and with code in your repository?
2. If I buy the repository, can I easily develop it on my lapdope using my own recordings audio and using your code? Because I am new in this field, I am not fully informed that’s why I have such questions.
1. Ronan | Trelis March 14, 2024 at 5:08 pm
  Reply
  1. Yes, fine-tuning should help.
  2. Yes, OR you can use a free Google Colab notebook if you don’t have a GPU yourself. Have a look the video (I assume you may have already).
2. 1. Emi March 24, 2024 at 2:58 pm
    Reply
    NVIDIA GeoForce RTX4060 Laptop GPU version 528.66 8GB, SSD 1T, Memory 32GB. Will it work? Can you show me way how I will make in .vtt format transcript and audio? I am willing to buy it if I find some flaws in your video. Maybe my questions are very simple, I ask these questions because I don’t know much
  2. 1. Ronan | Trelis March 25, 2024 at 6:40 pm
      Reply
      Yes, you probably can use that. If not, just use a free google colab notebook. Check out the transcription video on youtube.com/@TrelisResearch
Em March 29, 2024 at 6:55 am
Reply
Hi I purchased the script and had few question on documentation. For my native language, I used the fine tuning model “openai/whisper-small” as you show in your script, but it didn’t work as you said. At the end of run, it appears that the validation transcription is very messy and the accuracy is low. What else can you recommend?
I test epochs 5,6,7. But it returns not good results WER is over 0.70
1. Ronan | Trelis April 2, 2024 at 2:40 pm
  Reply
  My apologies, I thought I had responded earlier.
  The best place to post is in the GitHub repo, since you purchased access. You can create an issue there and I tend to respond quickly.
Em April 2, 2024 at 11:42 am
Reply
Hey, please, I did not get respond yet. Should I fine tune “openai/whisper-medium” even it demands more GPU or I should train “openai/whisper-small” model?
1. Ronan | Trelis April 2, 2024 at 2:31 pm
  Reply
  typically I find small good enough. Try it and if the results are not good, then do medium.
Andy June 23, 2024 at 9:49 pm
Reply
Hi Ronan, great job on the repo! Two questions:
i) What’s the difference between this repo and the free one that you made mention to on huggingface https://huggingface.co/blog/fine-tune-whisper
ii) I watched your demo on Youtube where you used a single file to fine-tune, how would I change this to feed in an entire folder of 10 hours (or eventually 100s of hours) of audio files with their respective transcriptions?
1. Ronan | Trelis July 2, 2024 at 5:18 pm
  Reply
  Howdy:
  i) This repo includes more detail and options around specifying the LoRA, but the biggest difference is that I have put together scripts (shown in the video) for data preparation.
  ii) The script splits audio (and transcripts) into very short chunks for training. You would adjust the code to loop through this for each of your input files.
Romel July 18, 2024 at 5:09 pm
Reply
Hello, just watched your Text to Speech Fine-tuning Tutorial and I have a few questions.
1. Does, Purchase Transcription Repo Access includes both StyleTTS2 and Fine-tuning Whisper?
2. Do you use ADVANCED-inference, ADVANCED-fine-tuning and ADVANCED-vision repositories in the video I’ve mentioned? If not, which video can I watch from where you showcase these repos?
3. I have a custom transcription dataset and wanted to fine tune it (TTS), but the audio are in 0.8sec to 10 secs chunks, can I test it with the collab notebooks (dataset creation and fine tune notebook) as is with the script or I have to make ‘many’ modification on the code to make it work?
4. So, I am guessing the fine tuning works on any language for the StyleTTS2 and Whisper?
1. Ronan | Trelis July 18, 2024 at 5:37 pm
  Reply
  Howdy!
  1. Yes!
  2. No. You can navigate to those pages from the Trelis.com home page and then scroll down to see the related YouTube videos.
  3. Are the chunks related or all independent? If they are related, it would be best to combine them into one piece of audio and then place that in the audio folder – you can write a script to do that with ChatGPT or find a free website online. If they are independent, then you can just add them all to the audio folder (if in wav format). If the segmentation script doesn’t work that I showed, you can run a commented out simpler script in the notebook that will just do one segment per input file row.
  4. Works for main languages, probably not so well for minority languages.
Jorez August 9, 2024 at 5:16 pm
Reply
Any plans on doing parler-tts/parler-tts-large-v1 ? looks interesting.
1. Ronan | Trelis August 11, 2024 at 12:23 pm
  Reply
  Sorry, just seeing this now. I think we took a quick look at it but found that StyleTTS2 was stronger and easier to implement.
yousaf September 27, 2024 at 1:40 am
Reply
I study in my university. for my projects i fine tune my model. i love to watch your videos
1. Ronan | Trelis October 11, 2024 at 2:36 pm
  Reply
  Cheers, thanks!
Raviraj November 3, 2024 at 10:28 am
Reply
Hi ,
I am working on a text to speech for converting audio files to different languages with the original speaker’s voice. I have the needed transcripts .
Can it be useful ? . how do I approach it. Thank you
1. Ronan | Trelis December 1, 2024 at 5:00 pm
  Reply
  I recommend trying things out with no fine-tuning, exactly as you want the text to speech to work. This gives a baseline. You can then try one language at a time, fine-tuning with your dataset. Note that it can be difficult to add new language capabilities and probably requires significant amounts of data. Perhaps 10k – 100k hours worth of audio. That is not an easy task.
Samuel Bergmann November 12, 2024 at 6:30 pm
Reply
Hi,
My team and I have developed an app called Storyscroll that helps children improve their phonics skills. Through the teacher portal, educators can select specific words for students to focus on, and the child then reads a story to the app. The app Storyscroll is designed to detect phonics-related errors and provide this feedback to both teachers and parents. It is currently available on the App Store.
Our developer on Fiverr is uncertain about fine-tuning the model to detect phonics-specific errors. While the app can identify basic word recognition errors, it currently lacks the ability to pinpoint phonics skills that need improvement.
After watching your video on YouTube, I thought you might be able to offer some guidance or assistance.
1. Ronan | Trelis November 12, 2024 at 6:38 pm
  Reply
  Howdy, this isn’t trivial to do, but potentially you could start with a model like whisper and do some fine-tuning on it with a classification head applied to the tip of the model. You would need to try classifying good and bad examples of speech. Potentially, you could use related but wrong words to generate the incorrect sound / voice.
  Just a few tips, hope that helps you get a good direction.
Jean November 23, 2024 at 6:54 pm
Reply
Hello.
I am interested in buying the Transcription Repo Access. I want to fine-tune whisper in order to transcript new words. I already tried several things, like this one.
https://github.com/vasistalodagala/whisper-finetune
I could never make it work, there was always something wrong
Things like
“ERROR: Could not find a version that satisfies the requirement pkg_resources==0.0.0 (from versions: none)”
Each time I found indicated ways to solve the problems, but finally, I couldn’t solve all. Sometime no solution worked.
Can you please confirm that your project is easily installed ? Does it come with an installer, or Do I need many actions that could all fail ?
What would happen if I can’t make it work on my computer.
(I have a RTX4090 with 24 GB vram, a core I7-13700 and 128 GB of DDR4 under Windows 11.)
1. Ronan | Trelis November 24, 2024 at 1:28 am
  Reply
  Howdy, yes installation should work on your GPU. If you buy this repo and have an issue you can post in Issues in the repo and I’ll help there. Worst case if the script doesn’t run I’m happy to refund.
kalyan January 18, 2025 at 10:32 am
Reply
if i choose the annual subscription and cancel it within a month, will I be charged only for one month
1. Ronan | Trelis January 18, 2025 at 3:01 pm
  Reply
  Howdy! You’ll be charged the full amount because the annual subscription provides access to the full archive of content plus new content for a period of one year. If you only need specific scripts, you can purchase them individually OR post here and I’ll create a payment link if there isn’t yet a specific script for a video.
Volkmar March 20, 2025 at 7:46 am
Reply
Hi Ronan,
thank you for the very descriptive video. I would like to by access but have one question concerning my use-case. I want to record text from the user to enter entries in a database for agriculture like “I harrowed 2 hours the field ‘woolmer lodge'” and automatically enter duration:2h, field:woolmer lodge, work type:harrow. I implemented something with an older VOSK language implementation (purely statistical model) which I fine-tuned with work type names and field names. I do postgreocessyng in postgres with pg_trgm but overall the result is poor. I want to switch to whisper. Would I use the same approach as you showed in your video and do post processing? The tricky bit are the field names which change all the time and are weird word constructs. Thanks for your help, Volkmar
1. Ronan | Trelis March 20, 2025 at 9:39 am
  Reply
  Howdy Volkmar, yes moving to whisper is probably a good idea. For those tricky fields, you may want to do fine-tuning as those may be hard to copy. As a first pass, you may find that using raw whisper with some find and replace could work. BTW, probably it’s best to transcribe and then have an LLM organise into structured output.
2. 1. Volkmar March 20, 2025 at 10:23 am
    Reply
    Ok good, one more question: for the LLM to organise into structured output: I would need access to the other repros to quick start? And btw. the annual subscription shows: “Advanced Transcription + Voice-cloning Repo – Annual Subscription abonnieren 74,99 € pro Monat” which is per month. The other repos show per year as expected. May be it’s just because it’s the German translation?
  2. 1. Ronan | Trelis March 20, 2025 at 10:42 am
      Reply
      Thanks Volkmar, I’ve fixed the link now so that that is annual. That was my mistake.
      For structured outputs, there are some scripts in ADVANCED-inference, yes – and there is a video on youtube. Although you may be able to figure out things yourself.
Nitin April 3, 2025 at 4:18 am
Reply
Hi, I just want the script for Speech detection and end of turn detection. Can you please send me the cost and the link to buy
1. Ronan | Trelis April 3, 2025 at 3:48 pm
  Reply
  yes will aim to get that for you soon and add it, probably in the next 2 days
2. Ronan | Trelis April 4, 2025 at 1:04 pm
  Reply
  link added now
Nicola April 3, 2025 at 3:32 pm
Reply
Hi Ronan,
thank you for your excellent work. I’m about to start a Whisper Finetuning for all the tasks mentioned in your video, namely add new vocabulary, improve performance on accents and on uncommon languages. Concerning my use case, the kind of audio files I wish to transcript one day will regard gergal conversation (aviation domain). In this context, speakers often mixed two languages together. There are so many sentences in which a topic is discussed in Italian with some extra English terms inside of them.
How can this mixture of gergal, Italian and English can be handled with finetuning?
1. Ronan | Trelis April 3, 2025 at 3:47 pm
  Reply
  Hi Nicola, if you have repo access, it’s best to post an issue there to get support. The short answer is that whisper transcribes each section starting with a language token. This will throw things off if speech is mixed. So you may need to train a new token, or else just keep the same English token for your data, even if your data is mixed.
NEELESH SETHI April 7, 2025 at 4:53 pm
Reply
Hi Ronan, I am working for IT consultant and trying to upskill my Gen AI slillset. I would like to purchase the repo to upgrade my core LLM skills and and so that I get new job as LLM data scientist. Can you please let me know if I would be eligible for discount ?
1. Ronan | Trelis July 1, 2025 at 11:39 am
  Reply
  Howdy! If this is for personal non-commercial use (not to build products or serve clients) then that’s fine.
Shane Ethan Steer May 7, 2025 at 3:15 pm
Reply
Hi Ronan,
I am thinking of purchasing your ADVANCED-transcription Repo. I just have a few questions before I make the purchase.
1. Is the pipeline built in a modular fashion (e.g. each stage as a separate function/class), so that I can insert my own cleaning/normalization component?
2. Do you support configuring external post-processing tools (spaCy, LanguageTool, custom term-correction lists) via a config file or plugin system, or would I need to modify the core code directly?
3. Are there any built-in abstractions for loading extra Python packages as pipeline steps, and how do you recommend managing those dependencies?
4. If I add heavier post-processing (e.g. rule-based corrections or speaker labels), is there a recommended way to batch or parallelize those steps within your serving framework?
5. How does the repo handle failures in downstream stages? Is there a standardized way to catch/log errors from custom post-processing so they show up in your dashboards or logs?
6. When you release updates (e.g. newer Whisper models or bugfixes), what’s the recommended workflow to merge those without losing my custom pipeline additions?
1. Ronan | Trelis May 7, 2025 at 4:09 pm
  Reply
  Howdy Shane, the short answer is that the scripts are as they are shown in the videos (which you can find at the bottom of this page here). I recommend at least quickly going through the most relevant videos for you so you know what to expect.
  1. Basically there are data prep scripts and then fine-tuning scripts/notebooks.
  2. Modify directly
  3. Nothing in-built like that
  4. That’s a very broad question, take a look at a video and then ask me if you have Qs, cos there are a lot of applications in this repo
  5. V broad question – too broad to answer with you being precise on what video you’re referring to.
  6. Typically new releases I will create a new folder for those models. If there’s a bug, I’ll just fix and often archive the old script.
Lukmal Ilyas June 2, 2025 at 5:30 am
Reply
Can I finetune qwen audio for language translation tasks? eg – hindi audio to english text?
1. Ronan | Trelis June 7, 2025 at 1:17 am
  Reply
  Yes, you can. Start off by seeing how good performance is with no fine-tuning. Test also its performance in Hindi only. These tests will give you a sense of how good it might be when fine-tuned.
Hussein abbasian June 3, 2025 at 6:30 pm
Reply
hi
Is the ‘Trelis_StyleTTS2_Finetune’ module available for separate purchase? If so, what’s the cost? I can’t afford the full subscription.
1. Ronan | Trelis June 7, 2025 at 1:16 am
  Reply
  Yes! the link is above!
Remixa June 20, 2025 at 10:15 am
Reply
Halo Ronan! I’m an LLM engineer, with only basic knowledge in NLP and some fundamental AI concepts.
Is the content of the audio pack still suitable for me to learn directly? Do I need to specifically study the basics of audio?
1. Ronan | Trelis June 20, 2025 at 11:59 am
  Reply
  I would think so, yes! but you can just check out some of the videos here and they will help you confirm.
tom June 30, 2025 at 2:02 pm
Reply
hello ,
i want to purchase individual finetuning code of styletts2 , what code will you provide with it ? will you provide dataset_curation.ipynb file too ?
1. Ronan | Trelis July 1, 2025 at 11:38 am
  Reply
  Howdy, includes everything in that video
Abdullah July 27, 2025 at 10:50 am
Reply
Hi,
First, thank you for your insightful videos. I recently watched Run Speech‑to‑Speech Models on Mac or GPU and I’m very interested in purchasing your product to explore it further. However, before I proceed, I’d like to know: how can I enable or support the Arabic language? I read in the Hugging Face documentation that only a few languages are currently supported.
1. Ronan | Trelis July 28, 2025 at 6:47 pm
  Reply
  Thanks! You’ll need to try and just run the model with Arabic on some quick tests and see how good it is. That’s really the only way to get a feel for it. try the huggingface speech to speech repo for some quick tests if you need to before buying this
Mohamed Harzallah July 30, 2025 at 6:30 pm
Reply
Hello, I would like to buy a repo that can easily finetune orpheus as well as have a guide on how to prepare the dataset.
For the past week I’ve been bashing my head against the wall trying to figure out how to finetune orpheus, it would be really nice if you had something that can just let me input my audio files, transcripts and then my finetuned orpheus would be ready without any extra headaches
Thanks
1. Ronan | Trelis July 30, 2025 at 7:20 pm
  Reply
  yes I cover that in great detail in this repo. Scroll down below and check the last few videos.
2. 1. Mohamed Harzallah July 30, 2025 at 9:20 pm
    Reply
    great, how much is the individual repo for finetuning orpheus?
  2. 1. Ronan | Trelis July 31, 2025 at 7:42 am
      Reply
      howdy I just added a link so you can buy the TTS scripts only. see above
Bhushan December 5, 2025 at 7:02 am
Reply
Hello,
By any chance are you working on fine tuning of speech to speech models like liquid fm, mini omni, moshi. If yes by when we can expect that and will it be included in this repo or you will built another paid repo for that?
1. Ronan | Trelis December 6, 2025 at 8:49 am
  Reply
  I have a video in the pipeline, hopefully this year for kyutai speech to text. Possibly speech to speech might follow, and yes, it would be in the advanced-audio repo product.