Skip to content

GuglielmoCerri/TubeTrim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Digler Logo

TubeTrim - Trim the Mess. Keep the Best.

⚠️ Note: The quality, language accuracy, and coherence of the generated summaries depend heavily on the model you choose.

Python Version License: MIT LLM Version Open Source

TubeTrim Demo

Why TubeTrim?

A free, open-source YouTube video summarizer that runs locally on your machine. Automatically detects and utilizes your available hardware (CPU, NVIDIA GPU, or Apple Silicon MPS). Feed it a YouTube link and get a written summary generated locally.

                YouTube URL
                    │
                    ▼
┌─────────────────────────────────────────────────┐
│              Gradio Web UI  (port 7860)         │
│  - Takes YouTube URL + language from the user   │
│  - Calls the API with stream=true               │
│  - Renders tokens progressively as they arrive  │
└───────────────────┬─────────────────────────────┘
                    │ 
                    ▼
┌─────────────────────────────────────────────────┐
│           FastAPI Backend  (port 8000)          │
│                                                 │
│  1. Fetch transcript via youtube-transcript-api │
│  2. Compress long transcripts (extractive NLP)  │
│  3. Split into chunks (≤4000 chars each)        │
│  4. Run each chunk through the LLM              │
│  5. Stream tokens back as NDJSON events         │
│  6. Parse hashtags generated by the LLM         │
└───────────────────┬─────────────────────────────┘
                    │ 
                    ▼
┌─────────────────────────────────────────────────┐
│           Hugging Face Transformers             │
│                                                 │
│  Model: Qwen/Qwen2.5-1.5B-Instruct (default)    │
│  Backend: auto-detected (CUDA › MPS › CPU)      │
│  Streaming: TextIteratorStreamer (token-by-     │
│             token, runs in a background thread) │
└─────────────────────────────────────────────────┘

Step-by-Step Flow

  1. Transcript fetch: youtube-transcript-api extracts the video's captions (auto-generated or manual). No audio download or speech-to-text involved.
  2. Extractive compression: If the transcript exceeds 3000 characters, the most informative sentences are selected using TF-IDF-style word frequency scoring, reducing the text roughly by half before it is passed to the LLM.
  3. Chunking: The compressed text is split into sentence-boundary-aware chunks of at most 4000 characters so that each chunk fits within the model's context window after tokenization.
  4. LLM summarization: Each chunk is passed to the language model, which generates a summary and 5 relevant hashtags in the requested language.
  5. Streaming: When stream=true, generation uses TextIteratorStreamer from the transformers library. The model runs in a background thread and tokens are yielded to the HTTP response as NDJSON events in real time. The UI appends each token to the display as it arrives.
  6. Hashtag parsing: Once generation is complete, #hashtags produced by the LLM are extracted from the accumulated output via regex and returned as a separate NDJSON event, rendered as colored badges in the UI.

Hardware Support

This project is designed to run efficiently on a variety of hardware configurations:

  • NVIDIA GPUs (CUDA): Automatically detected for accelerated inference using bfloat16.
  • Apple Silicon (MPS): Native acceleration for M-series chips via Metal Performance Shaders. Includes automatic fallback for operations not yet implemented in MPS.
  • CPU: Standard fallback for machines without dedicated GPU resources.

Design choices for performance:

  • Dynamic Precision: Automatically selects bfloat16 (CUDA), float16 (MPS), or float32 (CPU) based on your hardware.
  • Extractive pre-compression: Reduces the text the LLM must process by ~50%, cutting inference time significantly regardless of hardware.
  • Smart Chunking: Splits text by sentence boundaries to ensure context remains intact for the model.

Prerequisites

  1. Python 3.10+ (3.12 recommended) — Download
  2. uv (Python package manager) — Install

Installation

cd TubeTrim
uv sync

uv sync creates a .venv virtual environment and installs all dependencies.

The language model itself (~3.5 GB) is not downloaded during uv sync. It is downloaded automatically on the first request to the API.

Configuration

cp .env.example .env

Edit .env to customise the model or generation parameters. All values have sensible defaults.

Environment Variables

Variable Default Description
HF_MODEL Qwen/Qwen2.5-1.5B-Instruct Hugging Face model ID
MODEL_TEMPERATURE 0.2 Sampling temperature — low values produce focused, deterministic output
MODEL_MAX_TOKENS 512 Maximum tokens to generate per chunk
MODEL_TOP_P 0.9 Top-p (nucleus) sampling threshold
MODEL_REPETITION_PENALTY 1.0 Penalty for repeating tokens (1.0 = no penalty)
API_HOST 0.0.0.0 FastAPI bind address
API_PORT 8000 FastAPI port
GRADIO_SERVER_PORT 7860 Gradio UI port

Running the Application

Two terminals are required — one for the API and one for the UI.

Terminal 1 — API server:

uv run yt-summarizer-api

On first launch the model is downloaded from Hugging Face Hub and cached locally (usually in ~/.cache/huggingface/hub). Subsequent starts load from the cache instantly.

The API is available at http://localhost:8000. Interactive docs at http://localhost:8000/docs.

Terminal 2 — Web UI:

uv run yt-summarizer-ui

Open http://localhost:7860 in your browser.

Changing the Model

Edit HF_MODEL in .env and restart the API server. The new model is downloaded automatically.

HF_MODEL=Qwen/Qwen2.5-32B-Instruct

API Usage

Non-streaming

curl -X POST http://localhost:8000/summarize \
  -H "Content-Type: application/json" \
  -d '{"youtube_url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ", "language": "English"}'

Response:

{
  "video_url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
  "language": "English",
  "summary": "...",
  "hashtags": ["#music", "#love", "#dance"],
  "backend": "cpu"
}

Streaming

curl -X POST http://localhost:8000/summarize \
  -H "Content-Type: application/json" \
  -d '{"youtube_url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ", "language": "English", "stream": true}'

The response is NDJSON — one JSON object per line:

Event type Description
{"type": "status", "content": "..."} Progress message (loading, compressing, chunking)
{"type": "token", "content": "..."} A generated token — append to previous tokens to build the summary
{"type": "hashtags", "content": [...]} Extracted hashtags, sent once at the end
{"type": "error", "content": "..."} Error message

Health Check

curl http://localhost:8000/health

Contributing

Building a reliable YouTube summarizer involves a surprising number of moving parts. Each supported model architecture, hardware backend, and transcript format requires its own handling to correctly load, process, and generate output.

We welcome contributions of all kinds, especially in areas like:

  • Adding support for new LLM architectures (e.g. encoder-decoder models like BART or T5)
  • Improving transcript compression and chunking strategies
  • Expanding hardware support and performance optimizations
  • Enhancing the Gradio UI or API interface
  • Adding fallback audio transcription for YouTube videos without captions (currently these return an error)
  • Adding support for gated models (e.g. google/gemma-2-2b, Meta LLaMA) with smoother HF_TOKEN handling and license prompting
  • Writing tests, documentation, or usage examples

Whether you're experienced with NLP pipelines or just getting started with local LLMs, your contributions are appreciated.

License

TubeTrim is released under the MIT License.

About

Summarize any YouTube video in 12 languages using open-source LLMs without API keys.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages