bongovaad is a Python package for transcribing Bengali audio from YouTube videos. It uses the Hugging Face Inference API with Whisper models for high-quality speech recognition.
- Cloud-based Transcription: Uses Hugging Face Inference API with state-of-the-art Whisper models.
- SRT Subtitle Generation: Automatically creates SRT subtitle files for video players.
- Efficient Audio Processing: Handles audio segmentation for longer videos with progress tracking.
- Concurrent Processing: Uses asynchronous requests for faster transcription of multiple segments.
- Temporary File Management: Uses temporary directories for clean processing.
- Robust Error Handling: Comprehensive error handling and logging.
- Command-line Interface: Easy-to-use CLI with multiple options.
- Python 3.8 or higher
- Hugging Face API key
- ffmpeg
Ubuntu/Debian:
sudo apt install ffmpeg -ymacOS (using Homebrew):
brew install ffmpegWindows (using Chocolatey):
choco install ffmpegpip install bongovaad- Create an account on Hugging Face
- Generate an API key at https://huggingface.co/settings/tokens
- Set it as an environment variable:
export HF_API_KEY="your_api_key_here"bongovaad --url "https://www.youtube.com/watch?v=VIDEO_ID"bongovaad --url "https://www.youtube.com/watch?v=VIDEO_ID" \
--segment-length 10 \
--output-format both \
--model-id "openai/whisper-large-v3-turbo" \
--api-key "your_api_key_here" \
--verbose| Argument | Description | Default |
|---|---|---|
--url |
YouTube URL for audio transcription | (Required) |
--segment-length |
Length of each audio segment in seconds | 8 |
--output-format |
Output format (txt, srt, or both) | both |
--api-key |
Hugging Face API key | HF_API_KEY env var |
--model-id |
Model ID to use for transcription | openai/whisper-large-v3-turbo |
--verbose |
Enable verbose logging | False |
The tool generates two types of output files:
- Text File (
VIDEO_ID.txt): Contains the full transcription text. - SRT File (
VIDEO_ID.srt): Contains time-coded subtitles compatible with video players.
You can also use bongovaad as a Python library:
import os
from bongovaad import BongoVaadTranscriber
# Get API key from environment variable or set it directly
api_key = os.environ.get("HF_API_KEY", "your_api_key_here")
# Initialize the transcriber
transcriber = BongoVaadTranscriber(
api_key=api_key,
model_id="openai/whisper-large-v3-turbo"
)
# Transcribe a YouTube video
output_files = transcriber.transcribe(
url="https://www.youtube.com/watch?v=VIDEO_ID",
segment_length_seconds=10,
output_format="both"
)
# Print output file paths
print(f"Text file: {output_files['txt']}")
print(f"SRT file: {output_files['srt']}")- Processing time depends on video length, internet connection, and Hugging Face API response times.
- Longer segment lengths may improve speed but could reduce accuracy for complex audio.
- The API has rate limits, so be mindful of how many requests you make.
This project is licensed under the MIT License. See the LICENSE file for details.
Contributions are welcome! Here's how you can contribute:
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Commit your changes:
git commit -am 'Add some feature' - Push to the branch:
git push origin feature-name - Submit a pull request
- Hugging Face for providing the Inference API
- Whisper for the state-of-the-art ASR model
- OpenAI for developing the Whisper model