Inspiration

Background

The global audio description services market size is projected to grow from USD 0.39 billion in 2024 to USD 0.64 billion by 2034, exhibiting a CAGR of 4.43% during the forecast period.

Problem

The majority of videos with audio description are still found in movies. Despite efforts from many video platforms, like YouTube and other streaming services, many users report limited availability of audio description support.

Opportunities

With the expansion of digital content, there is an opportunity to bridge this gap and make videos more inclusive and accessible for everyone.

What It Does

ADA (Audio Description Agent) is a Discord-based service that transforms any video into an accessible experience for visually impaired users through real-time AI-generated audio descriptions. Here’s how it works:

  1. Voice Command: Users speak their request in Discord (e.g., “Play ‘Avatar’ from 30:00 to 40:00”). No complex commands—just natural language.
  2. Automated Processing:
    • Searches for the video using YouTube Data API.
    • Captures key frames (2s/frame via FFmpeg).
    • Analyzes scenes with OpenAI GPT-4o and generates descriptive narrations.
  3. Instant Delivery:
    • Returns a video link alongside an MP3 audio track containing synchronized scene descriptions.
    • Playback combines the original video with AI-generated audio, mimicking professional human-described content.
  4. Key Features:
    • Multi-language Support: 20+ languages via Discord’s voice interface.
    • Low Latency: Processes audio descriptions within 1.5× the video’s runtime.
    • Scalable Storage: Metadata (timestamps, text segments) organized in Google Sheets.

How We Built It

System Architecture

  • Frontend:
    • Discord Bot: Built with Python’s discord.py, handling voice commands and MP3 delivery.
    • Minimalist UI: Users speak directly in a channel—no "@" mentions or slash commands.
  • Middleware:
    • Make.com Automation: Orchestrates video search, frame extraction, and data routing.
  • Backend:
    • AI Modules:
    • GPT-4o for scene analysis (5 images/group, 1024-token context).
    • GPT-3.5-turbo for text alignment and continuity fixes.
    • TTS Synthesis: ElevenLabs Multilingual v2 (neutral narration style, 140–180 WPM).
    • Storage: Google Drive (media files) + Google Sheets (metadata tracking).

Technical Innovations

  • Parallel Processing: FFmpeg frame extraction and GPT-4o calls run concurrently.
  • Dynamic Frame Sampling: Adjusts capture intervals (1–4s) based on motion detection.

Challenges We Ran Into

  • Initial sync accuracy: 78% → Improved to 86% via dynamic TTS speed adjustment.
  • Mitigated with batch processing and fallback to GPT-3.5-turbo during peak loads.
  • Make.com’s non-realtime triggers caused delays → Pre-cached popular videos.
  • Iteratively trained GPT-4o to prioritize spatial and action descriptors (e.g., "slow pan to the left").

What We Learned

  • Balancing Speed & Quality: Hybrid AI pipelines (GPT-4o + GPT-3.5-turbo) reduce latency without sacrificing accuracy.
  • Design Simplicity Matters: Voice-only commands increased user adoption by 40% in early tests.
  • Scalability Trade-offs: Third-party APIs accelerated development but introduced dependency risks.
  • Ethical A: Audited training data to eliminate biased descriptions (e.g., avoiding gender/racial assumptions).

Future Development

Short-Term (2025)

  • Run lightweight GPT-4o Mini on NVIDIA Jetson Orin for offline processing while enabling crowdsourced refinement of AI outputs to build a high-quality dataset.
  • Prototype smart glasses integration with Sony IMX678 camera and bone-conduction audio.

Long-Term (2026+)

  • Develop a collaborative annotation platform to crowdsource refinement of AI outputs, building a high-quality dataset for improved accuracy.
  • Implement vision transformers to identify emergency scenes (e.g., fire, falls) and trigger real-time alerts for user safety.

Vision: Expand beyond entertainment into navigation and social interaction, creating a universal accessibility layer for the visually impaired.

Team Members and Roles

  1. Annan YangProduct Design/Developer & Testing

    • Contributed to the overall system development and the ElevenLabs-related workflow.
    • Led testing efforts to improve AI model performance and system accuracy.
  2. Brett (Yirong) BianDeveloper & Workflow Specialist

    • Developed the backend and built workflows using Make.com for video search, frame extraction, and data routing.
    • Worked on integrating ElevenLabs TTS synthesis and developed associated workflows.
  3. Chenghong TangDeveloper & Automation Specialist

    • Developed the Discord bot and integrated voice command functionality for smooth user interaction.
    • Contributed to building workflows and automation processes throughout the system.
  4. Emily (Yingyi) LeiProduct Design & Prompt Engineer

    • Conducted technical research, and designed the overall product concept, key features and user experience.
    • Responsible for prompt engineering of models to ensure accurate scene descriptions and functionality.

Built With

Share this project:

Updates