Inspiration

I was personally frustrated with the Auto-generated YouTube captions as they're notoriously inaccurate, often missing vital context, hallucinating words, or lacking basic punctuation. We realized that millions of hard-of-hearing users, language learners, and people simply watching videos are settling for a sub-par experience. We wanted to build a tool that doesn't just transcribe words verbatim, but actually understands context to deliver flawless, readable subtitles in real-time.

What it does

Better YouTube Captions is a real-time, low-latency caption generator that completely upgrades the default YouTube viewing experience. Our companion Chrome extension seamlessly triggers a pipeline that extracts audio from the video, and within milliseconds, generates highly accurate, grammatically perfect custom subtitles and injects them directly back into your browser's video player without missing a beat. It also caches the generated SRT file into the file system to be re-used later on!

How we built it

Our project relies on a powerful, low-latency streaming pipeline. Here is our end-to-end architecture:

Main flow
[Chrome Extension]
        │
        │  Sends YouTube URL (WebSocket)
        ▼
[Node.js Backend (Vultr)]
        │
        │  Check cache (by video ID)
        ├───────────────┐
        │               │
        │ Cache hit     │ Cache miss
        ▼               ▼
[SRT Cache]         [yt-dlp]
        │               │
        │ Load SRT      │ Extracts continuous audio
        ▼               ▼
[Stream Captions]   [FFmpeg]
        │               │
        │               │ Transcodes + chunks audio
        │               ▼
        │           [Data Buffers]
        │               │
        │               │ Streams audio
        │               ▼
        │       [ElevenLabs API]
        │               │
        │               │ Transcript + timestamps
        │               ▼
        │       [Gemini Context Manager]
        │               │
        │               ▼
        │       [Google Gemini API]
        │               │
        │               ▼
        │       [Timestamp Sync Logic]
        │               │
        │               ▼
        │       [Final Captions]
        │               │
        │               │ Streams back (WebSocket)
        ▼               ▼
[Chrome Extension]  [Chrome Extension]
        │
        │ Inject captions
        ▼
[YouTube Video]

Side Effect (Not part of main flow) 

[Final Captions]
        │
        │ IF complete
        ▼
[Save SRT → Filesystem Cache]
  • UI: A custom Chrome extension sends the current video URL to the backend and handles injecting the new subtitles seamlessly into the YouTube player.
  • Backend Hosting: Our robust Node.js server is deployed on Vultr Cloud to ensure stable, high-speed networking and processing. Triggers the pipeline if there's a cache miss, else returns the cached .srt file.
  • Audio Extraction & Processing: We use yt-dlp to fetch the audio stream directly from the YouTube URL. This stream is piped directly into FFmpeg, which instantly transcodes the raw audio into an optimized format and chunks it into continuous data buffers.
  • Speech-to-Text: We leverage ElevenLabs for its lightning-fast, highly accurate transcription capabilities. Specifically, we utilize their WebSocket-based real-time transcription API to continuously stream the FFmpeg audio chunks, receiving transcriptions back in milliseconds as the words are spoken.
  • Contextual Correction Engine: Finally, the transcriptions are passed through Google Gemini's fast models. Gemini acts as an intelligent, real-time "sanity pass" to correct grammatical errors, fix STT hallucinations, and maintain perfect context before the text is sent back to the client.

Challenges we ran into

  • Audio Extraction with yt-dlp: Using yt-dlp to fetch live audio wasn't trivial at all. We struggled with unreliable package wrappers and broken data pipes before finally relying on the precompiled yt-dlp binary executed natively on our backend to guarantee a stable, continuous stream.
  • Real-time Streaming Architecture: Implementing a true real-time streaming pipeline from end-to-end was incredibly complex. Balancing low-latency chunking via FFmpeg while feeding ElevenLabs and Gemini large enough batches to maintain accuracy, all while streaming the data back to the extension via Websockets.
  • Preserving Sync: Passing transcriptions through an LLM natively strips away the exact timing of the individual words. We had to build careful logic to map Gemini's dynamically corrected text back to the original granular timestamps from ElevenLabs so the captions remain perfectly synced to the video.

Accomplishments that we're proud of

  • Getting yt-dlp to run. Lol!
  • How genuinely realtime everything is.
  • Deploying a true "click and forget" Chrome extension that works instantly on any YouTube video without requiring complex user configuration.

What we learned

  • Deep, practical knowledge of audio extraction tools like yt-dlp and FFmpeg, continuous WebSocket streaming, and managing data buffers.
  • How to aggressively optimize LLM prompts and API calls—specifically utilizing Gemini's streaming capabilities for low-latency, real-time data processing.

What's next for Better YouTube Captions

  • Multi-language support: Prompting Gemini to not just correct grammar, but translate the captions into dozens of languages on the fly.
  • Seeking to random points on the video: Allow the user to seek to anywhere on the video, and the entire pipeline now runs from that point (If not previously covered)

Built With

Share this project:

Updates