GitHub - GuglielmoCerri/TubeTrim: Summarize any YouTube video in 12 languages using open-source LLMs without API keys.

TubeTrim - Trim the Mess. Keep the Best.

⚠️ Note: The quality, language accuracy, and coherence of the generated summaries depend heavily on the model you choose.

Why TubeTrim?

A free, open-source YouTube video summarizer that runs locally on your machine. Automatically detects and utilizes your available hardware (CPU, NVIDIA GPU, or Apple Silicon MPS). Feed it a YouTube link and get a written summary generated locally.

                YouTube URL
                    │
                    ▼
┌─────────────────────────────────────────────────┐
│              Gradio Web UI  (port 7860)         │
│  - Takes YouTube URL + language from the user   │
│  - Calls the API with stream=true               │
│  - Renders tokens progressively as they arrive  │
└───────────────────┬─────────────────────────────┘
                    │ 
                    ▼
┌─────────────────────────────────────────────────┐
│           FastAPI Backend  (port 8000)          │
│                                                 │
│  1. Fetch transcript via youtube-transcript-api │
│  2. Compress long transcripts (extractive NLP)  │
│  3. Split into chunks (≤4000 chars each)        │
│  4. Run each chunk through the LLM              │
│  5. Stream tokens back as NDJSON events         │
│  6. Parse hashtags generated by the LLM         │
└───────────────────┬─────────────────────────────┘
                    │ 
                    ▼
┌─────────────────────────────────────────────────┐
│           Hugging Face Transformers             │
│                                                 │
│  Model: Qwen/Qwen2.5-1.5B-Instruct (default)    │
│  Backend: auto-detected (CUDA › MPS › CPU)      │
│  Streaming: TextIteratorStreamer (token-by-     │
│             token, runs in a background thread) │
└─────────────────────────────────────────────────┘

Step-by-Step Flow

Transcript fetch: youtube-transcript-api extracts the video's captions (auto-generated or manual). No audio download or speech-to-text involved.
Extractive compression: If the transcript exceeds 3000 characters, the most informative sentences are selected using TF-IDF-style word frequency scoring, reducing the text roughly by half before it is passed to the LLM.
Chunking: The compressed text is split into sentence-boundary-aware chunks of at most 4000 characters so that each chunk fits within the model's context window after tokenization.
LLM summarization: Each chunk is passed to the language model, which generates a summary and 5 relevant hashtags in the requested language.
Streaming: When stream=true, generation uses TextIteratorStreamer from the transformers library. The model runs in a background thread and tokens are yielded to the HTTP response as NDJSON events in real time. The UI appends each token to the display as it arrives.
Hashtag parsing: Once generation is complete, #hashtags produced by the LLM are extracted from the accumulated output via regex and returned as a separate NDJSON event, rendered as colored badges in the UI.

Hardware Support

This project is designed to run efficiently on a variety of hardware configurations:

NVIDIA GPUs (CUDA): Automatically detected for accelerated inference using bfloat16.
Apple Silicon (MPS): Native acceleration for M-series chips via Metal Performance Shaders. Includes automatic fallback for operations not yet implemented in MPS.
CPU: Standard fallback for machines without dedicated GPU resources.

Design choices for performance:

Dynamic Precision: Automatically selects bfloat16 (CUDA), float16 (MPS), or float32 (CPU) based on your hardware.
Extractive pre-compression: Reduces the text the LLM must process by ~50%, cutting inference time significantly regardless of hardware.
Smart Chunking: Splits text by sentence boundaries to ensure context remains intact for the model.

Prerequisites

Python 3.10+ (3.12 recommended) — Download
uv (Python package manager) — Install

Installation

cd TubeTrim
uv sync

uv sync creates a .venv virtual environment and installs all dependencies.

The language model itself (~3.5 GB) is not downloaded during uv sync. It is downloaded automatically on the first request to the API.

Configuration

cp .env.example .env

Edit .env to customise the model or generation parameters. All values have sensible defaults.

Environment Variables

Variable	Default	Description
`HF_MODEL`	`Qwen/Qwen2.5-1.5B-Instruct`	Hugging Face model ID
`MODEL_TEMPERATURE`	`0.2`	Sampling temperature — low values produce focused, deterministic output
`MODEL_MAX_TOKENS`	`512`	Maximum tokens to generate per chunk
`MODEL_TOP_P`	`0.9`	Top-p (nucleus) sampling threshold
`MODEL_REPETITION_PENALTY`	`1.0`	Penalty for repeating tokens (1.0 = no penalty)
`API_HOST`	`0.0.0.0`	FastAPI bind address
`API_PORT`	`8000`	FastAPI port
`GRADIO_SERVER_PORT`	`7860`	Gradio UI port

Running the Application

Two terminals are required — one for the API and one for the UI.

Terminal 1 — API server:

uv run yt-summarizer-api

On first launch the model is downloaded from Hugging Face Hub and cached locally (usually in ~/.cache/huggingface/hub). Subsequent starts load from the cache instantly.

The API is available at http://localhost:8000. Interactive docs at http://localhost:8000/docs.

Terminal 2 — Web UI:

uv run yt-summarizer-ui

Open http://localhost:7860 in your browser.

Changing the Model

Edit HF_MODEL in .env and restart the API server. The new model is downloaded automatically.

HF_MODEL=Qwen/Qwen2.5-32B-Instruct

API Usage

Non-streaming

curl -X POST http://localhost:8000/summarize \
  -H "Content-Type: application/json" \
  -d '{"youtube_url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ", "language": "English"}'

Response:

{
  "video_url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
  "language": "English",
  "summary": "...",
  "hashtags": ["#music", "#love", "#dance"],
  "backend": "cpu"
}

Streaming

curl -X POST http://localhost:8000/summarize \
  -H "Content-Type: application/json" \
  -d '{"youtube_url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ", "language": "English", "stream": true}'

The response is NDJSON — one JSON object per line:

Event type	Description
`{"type": "status", "content": "..."}`	Progress message (loading, compressing, chunking)
`{"type": "token", "content": "..."}`	A generated token — append to previous tokens to build the summary
`{"type": "hashtags", "content": [...]}`	Extracted hashtags, sent once at the end
`{"type": "error", "content": "..."}`	Error message

Health Check

curl http://localhost:8000/health

Contributing

Building a reliable YouTube summarizer involves a surprising number of moving parts. Each supported model architecture, hardware backend, and transcript format requires its own handling to correctly load, process, and generate output.

We welcome contributions of all kinds, especially in areas like:

Adding support for new LLM architectures (e.g. encoder-decoder models like BART or T5)
Improving transcript compression and chunking strategies
Expanding hardware support and performance optimizations
Enhancing the Gradio UI or API interface
Adding fallback audio transcription for YouTube videos without captions (currently these return an error)
Adding support for gated models (e.g. google/gemma-2-2b, Meta LLaMA) with smoother HF_TOKEN handling and license prompting
Writing tests, documentation, or usage examples

Whether you're experienced with NLP pipelines or just getting started with local LLMs, your contributions are appreciated.

License

TubeTrim is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
src		src
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TubeTrim - Trim the Mess. Keep the Best.

Why TubeTrim?

Step-by-Step Flow

Hardware Support

Prerequisites

Installation

Configuration

Environment Variables

Running the Application

Changing the Model

API Usage

Non-streaming

Streaming

Health Check

Contributing

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TubeTrim - Trim the Mess. Keep the Best.

Why TubeTrim?

Step-by-Step Flow

Hardware Support

Prerequisites

Installation

Configuration

Environment Variables

Running the Application

Changing the Model

API Usage

Non-streaming

Streaming

Health Check

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages