A free, open-source YouTube video summarizer that runs locally on your machine. Automatically detects and utilizes your available hardware (CPU, NVIDIA GPU, or Apple Silicon MPS). Feed it a YouTube link and get a written summary generated locally.
YouTube URL
│
▼
┌─────────────────────────────────────────────────┐
│ Gradio Web UI (port 7860) │
│ - Takes YouTube URL + language from the user │
│ - Calls the API with stream=true │
│ - Renders tokens progressively as they arrive │
└───────────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ FastAPI Backend (port 8000) │
│ │
│ 1. Fetch transcript via youtube-transcript-api │
│ 2. Compress long transcripts (extractive NLP) │
│ 3. Split into chunks (≤4000 chars each) │
│ 4. Run each chunk through the LLM │
│ 5. Stream tokens back as NDJSON events │
│ 6. Parse hashtags generated by the LLM │
└───────────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Hugging Face Transformers │
│ │
│ Model: Qwen/Qwen2.5-1.5B-Instruct (default) │
│ Backend: auto-detected (CUDA › MPS › CPU) │
│ Streaming: TextIteratorStreamer (token-by- │
│ token, runs in a background thread) │
└─────────────────────────────────────────────────┘
- Transcript fetch:
youtube-transcript-apiextracts the video's captions (auto-generated or manual). No audio download or speech-to-text involved. - Extractive compression: If the transcript exceeds 3000 characters, the most informative sentences are selected using TF-IDF-style word frequency scoring, reducing the text roughly by half before it is passed to the LLM.
- Chunking: The compressed text is split into sentence-boundary-aware chunks of at most 4000 characters so that each chunk fits within the model's context window after tokenization.
- LLM summarization: Each chunk is passed to the language model, which generates a summary and 5 relevant hashtags in the requested language.
- Streaming: When
stream=true, generation usesTextIteratorStreamerfrom thetransformerslibrary. The model runs in a background thread and tokens are yielded to the HTTP response as NDJSON events in real time. The UI appends each token to the display as it arrives. - Hashtag parsing: Once generation is complete,
#hashtagsproduced by the LLM are extracted from the accumulated output via regex and returned as a separate NDJSON event, rendered as colored badges in the UI.
This project is designed to run efficiently on a variety of hardware configurations:
- NVIDIA GPUs (CUDA): Automatically detected for accelerated inference using
bfloat16. - Apple Silicon (MPS): Native acceleration for M-series chips via Metal Performance Shaders. Includes automatic fallback for operations not yet implemented in MPS.
- CPU: Standard fallback for machines without dedicated GPU resources.
Design choices for performance:
- Dynamic Precision: Automatically selects
bfloat16(CUDA),float16(MPS), orfloat32(CPU) based on your hardware. - Extractive pre-compression: Reduces the text the LLM must process by ~50%, cutting inference time significantly regardless of hardware.
- Smart Chunking: Splits text by sentence boundaries to ensure context remains intact for the model.
cd TubeTrim
uv syncuv sync creates a .venv virtual environment and installs all dependencies.
The language model itself (~3.5 GB) is not downloaded during
uv sync. It is downloaded automatically on the first request to the API.
cp .env.example .envEdit .env to customise the model or generation parameters. All values have sensible defaults.
| Variable | Default | Description |
|---|---|---|
HF_MODEL |
Qwen/Qwen2.5-1.5B-Instruct |
Hugging Face model ID |
MODEL_TEMPERATURE |
0.2 |
Sampling temperature — low values produce focused, deterministic output |
MODEL_MAX_TOKENS |
512 |
Maximum tokens to generate per chunk |
MODEL_TOP_P |
0.9 |
Top-p (nucleus) sampling threshold |
MODEL_REPETITION_PENALTY |
1.0 |
Penalty for repeating tokens (1.0 = no penalty) |
API_HOST |
0.0.0.0 |
FastAPI bind address |
API_PORT |
8000 |
FastAPI port |
GRADIO_SERVER_PORT |
7860 |
Gradio UI port |
Two terminals are required — one for the API and one for the UI.
Terminal 1 — API server:
uv run yt-summarizer-apiOn first launch the model is downloaded from Hugging Face Hub and cached locally (usually in ~/.cache/huggingface/hub). Subsequent starts load from the cache instantly.
The API is available at http://localhost:8000. Interactive docs at http://localhost:8000/docs.
Terminal 2 — Web UI:
uv run yt-summarizer-uiOpen http://localhost:7860 in your browser.
Edit HF_MODEL in .env and restart the API server. The new model is downloaded automatically.
HF_MODEL=Qwen/Qwen2.5-32B-Instructcurl -X POST http://localhost:8000/summarize \
-H "Content-Type: application/json" \
-d '{"youtube_url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ", "language": "English"}'Response:
{
"video_url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"language": "English",
"summary": "...",
"hashtags": ["#music", "#love", "#dance"],
"backend": "cpu"
}curl -X POST http://localhost:8000/summarize \
-H "Content-Type: application/json" \
-d '{"youtube_url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ", "language": "English", "stream": true}'The response is NDJSON — one JSON object per line:
| Event type | Description |
|---|---|
{"type": "status", "content": "..."} |
Progress message (loading, compressing, chunking) |
{"type": "token", "content": "..."} |
A generated token — append to previous tokens to build the summary |
{"type": "hashtags", "content": [...]} |
Extracted hashtags, sent once at the end |
{"type": "error", "content": "..."} |
Error message |
curl http://localhost:8000/healthBuilding a reliable YouTube summarizer involves a surprising number of moving parts. Each supported model architecture, hardware backend, and transcript format requires its own handling to correctly load, process, and generate output.
We welcome contributions of all kinds, especially in areas like:
- Adding support for new LLM architectures (e.g. encoder-decoder models like BART or T5)
- Improving transcript compression and chunking strategies
- Expanding hardware support and performance optimizations
- Enhancing the Gradio UI or API interface
- Adding fallback audio transcription for YouTube videos without captions (currently these return an error)
- Adding support for gated models (e.g.
google/gemma-2-2b, Meta LLaMA) with smootherHF_TOKENhandling and license prompting - Writing tests, documentation, or usage examples
Whether you're experienced with NLP pipelines or just getting started with local LLMs, your contributions are appreciated.
TubeTrim is released under the MIT License.

