A machine learning project by PInT (Public Interest in Tech) at Olin College of Engineering focused on improving automatic speech recognition (ASR) systems for people who stutter through fine-tuning state-of-the-art models and specialized preprocessing techniques.
This project addresses the challenge of accurately recognizing speech from people who stutter (PWS). Traditional ASR systems often struggle with stuttered speech due to disfluencies such as repetitions, prolongations, and blocks.
Our goal is to improve accessibility by fine-tuning state-of-the-art ASR models (Whisper, Wav2Vec, etc.) on stuttered speech datasets, applying specialized preprocessing and evaluation tailored to this population.
Stuttering affects approximately 70 million people worldwide, yet most commercial ASR systems are not optimized for recognizing stuttered speech patterns. This creates accessibility barriers for people who stutter when interacting with voice-controlled technologies.
Common stuttering disfluencies:
- Repetitions – repeating sounds, syllables, or words
- Prolongations – stretching out sounds
- Blocks – involuntary pauses or inability to produce sounds
- Specialized preprocessing for stuttered speech audio
- Model architectures adapted for disfluency patterns
- Data augmentation techniques for limited stuttered datasets
- Evaluation metrics tailored for stuttered speech recognition (e.g., WER analysis)
- Comparative analysis against standard ASR models
- Support for multiple languages (English and Mandarin)
- Fine-tuning implementations for Whisper and Wav2Vec models
.
├── FineTuneWhisper1.ipynb # Initial Whisper fine-tuning
├── FineTune_Whisper_English.ipynb # English Whisper fine-tuning
├── FineTune_Whisper_Mandarin.ipynb # Mandarin Whisper fine-tuning
├── FineTune_Wav2Vec_English.ipynb # English Wav2Vec fine-tuning
├── FineTune_Wav2Vec_Mandarin.ipynb # Mandarin Wav2Vec fine-tuning
├── multi_lingual_speech_recognition.ipynb # Multilingual ASR modeling
├── wer.ipynb # Word Error Rate analysis
├── SplitSet.ipynb # Dataset splitting utilities
├── processStutterGT.py # Data preprocessing script
├── requirements.txt # Python dependencies
├── train_data.csv # Training dataset
├── test_data.csv # Test dataset
├── examples/ # Example audio files and usage
├── analysis/ # Analysis results and visualizations
├── asr_processing/ # ASR processing utilities
├── asr_processing_test/ # Testing utilities
├── libristutter_result/ # Results on LibriStutter dataset
├── librispeech_result/ # Results on LibriSpeech dataset
├── StutterGTData/ # StutterGT dataset files
├── wer_scores_csv/ # WER evaluation results
└── [various merged/filtered CSV files] # Processed datasets
- Python 3.8+
- Jupyter Notebook / JupyterLab
- CUDA-compatible GPU (recommended for training) (Google Colab was used for this project)
- Clone the repository:
git clone https://github.com/dongim04/stuttered-speech-asr.git
cd stuttered-speech-asr- Install dependencies:
pip install -r requirements.txtPreprocess your stuttered speech dataset using the provided utilities:
# Using the preprocessing script
python processStutterGT.py
# Or using the Jupyter notebook
jupyter notebook SplitSet.ipynb- General:
FineTuneWhisper1.ipynb - English:
FineTune_Whisper_English.ipynb - Mandarin:
FineTune_Whisper_Mandarin.ipynb
- English:
FineTune_Wav2Vec_English.ipynb - Mandarin:
FineTune_Wav2Vec_Mandarin.ipynb
For cross-lingual experiments and multilingual model training:
jupyter notebook multi_lingual_speech_recognition.ipynbAnalyze model performance using Word Error Rate (WER) metrics:
jupyter notebook wer.ipynbResults are automatically saved in:
wer_scores_csv/- Detailed WER analysislibristutter_result/- Results on LibriStutter datasetlibrispeech_result/- Results on LibriSpeech dataset
- Preprocess your data:
python processStutterGT.py- Fine-tune a Whisper model for English:
jupyter notebook FineTune_Whisper_English.ipynb- Evaluate model performance:
jupyter notebook wer.ipynbThe notebooks are designed to work with various stuttered speech datasets. Modify the data loading sections in the notebooks to work with your specific dataset format.
See requirements.txt for full dependencies. Key packages include:
torch- PyTorch deep learning frameworktransformers- Hugging Face transformers libraryjiwer- Word Error Rate calculationpandas,numpy- Data manipulationlibrosa,soundfile- Audio processingmatplotlib,seaborn- Visualizationdatasets- Hugging Face datasets libraryaccelerate- Training acceleration
This project works with several stuttered speech datasets:
- LibriStutter - Stuttered version of LibriSpeech
- StutterGT - Ground truth stuttered speech dataset
- LibriSpeech - Used for comparison with fluent speech
Note: Ensure you have proper permissions and follow ethical guidelines when working with speech data.
The project includes comprehensive evaluation across multiple models and languages. Results are stored in:
wer_scores_csv/- Detailed WER comparisonsanalysis/- Performance analysis and visualizations*_result/directories - Model-specific results
-
Whisper (OpenAI)
- Multilingual support
- Various model sizes (tiny, base, small, medium, large)
- Robust to background noise
-
Wav2Vec 2.0 (Meta)
- Self-supervised learning approach
- Strong performance on limited data
- Language-specific fine-tuning
Contributions, improvements, and feedback are welcome!
- Fork the repository
- Create a feature branch (
git checkout -b feature/improvement) - Make your changes
- Test thoroughly
- Submit a pull request
- Additional model architectures
- New evaluation metrics
- Dataset preprocessing improvements
- Documentation enhancements
- Performance optimizations
This project is developed with respect for the stuttering community:
- Focuses on improving accessibility rather than "correcting" speech
- Maintains privacy and consent for all data usage
- Avoids perpetuating negative stereotypes about stuttering
- Emphasizes empowerment and inclusion
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this work in your research, please cite:
@misc{stuttered-speech-asr,
author = {dongim04},
title = {Stuttered Speech ASR: Fine-tuning ASR Models for People Who Stutter},
year = {2024},
publisher = {GitHub},
url = {https://github.com/dongim04/stuttered-speech-asr}
}- The stuttering research community for their valuable insights
- OpenAI for the Whisper model
- Meta for the Wav2Vec model
- Hugging Face for the transformers library
- Dataset creators and contributors
For questions, suggestions, or collaboration opportunities, please open an issue on GitHub.
This project aims to make voice technology more accessible for people who stutter while respecting the diversity and dignity of the stuttering community.