"Try it out" Link: https://discord.gg/D4QHrrMTh9
"Demo" Link: https://vimeo.com/1059599841
- Speak in "requests" channel. Just simply speak, no more "@" or "/" command.
- Get the result in "result" channel.
"How to use Make_blueprint.json"
- Download JSON file.
- Open Make.com.
- Create a "new scenario"
- Import your download JSON.
Version: 1.0
Last Updated: February 23, 2025
- Project Overview
- Current Implementation (Phase 1)
- 2.1 System Architecture
- 2.2 Core Modules and Tech Stack
- 2.3 Workflow and Implementation Details
- 2.4 Test Results and Limitations
- Future Development Plan (Phase 2+)
- 3.1 Real-time Image Processing Optimization
- 3.2 Hardware Integration and Wearable Devices
- 3.3 Multimodal Input and Collaboration Features
- Vision and Scalability
Goal: Provide a real-time, high-accuracy audio description service for visually impaired viewers by using AI to replace traditional human movie describers. Target a ≥86% synchronization rate between audio and video while extending to real-time environment perception scenarios.
Core Value:
- Accessibility: Reduce manual costs through automation, increasing service availability.
- Low Latency: Keep end-to-end processing within 1.2× the total video playback time.
- Scalability: A flexible architecture designed for future hardware integration and multimodal inputs.
- Front End: Discord voice interface (supports multiple languages).

- Middleware: Make.com automation workflow engine.

- Back End: Google Drive (storage), Google Sheets (structured data), OpenAI API (image analysis and text generation), ElevenLabs (TTS).

| Module | Implementation | Key Parameters |
|---|---|---|
| Video Search & Capture | YouTube Data API + FFmpeg | Capture interval: 2s/frame, resolution: 720p |
| Image Analysis | OpenAI GPT-4o | 5 images/group, context window: 1024 tokens |
| Text Alignment & Polishing | OpenAI GPT-o3 mini | Max length: 400 characters/segment |
| Speech Synthesis | ElevenLabs Multilingual v2 | Voice style: neutral narration, speed: 140–180 WPM |
| Data Storage | Google Drive + Sheets | Structured fields: timestamps, text segments, audio URLs |
- The user issues a voice command in Discord (e.g., “Play ‘Avatar’ from minute 30 to minute 40”).
- The Make.com workflow calls the YouTube Data API, FFMPEG captures screenshots every 2 seconds, and stores them in Google Drive (naming convention:
<videoID>_<timestamp>.png).
- Every 5 screenshots (covering 10 seconds of video) are sent to GPT-4o to generate a movie-style descriptive text snippet, for example:
“The main character, dressed in blue armor, stands at the edge of floating mountains with a glowing forest in the background. The camera slowly pulls back, revealing a deep chasm below.”
- These snippets are appended to Google Sheets in chronological order, including fields like
start_time,end_time,raw_text, andstatus.
- GPT-o3 mini aligns adjacent text snippets, fixing abrupt splits (e.g., merging “Character turns around” and “Door closes” into a continuous action).
- ElevenLabs produces audio clips, each carefully timed to 10 seconds (±0.5s), which are stored in Google Drive and returned to Discord for playback.
| Metric | Current Performance | Optimization Focus |
|---|---|---|
| Audio-Video Sync | 86% | Dynamically adjust TTS speed algorithm |
| End-to-End Latency | 1.5× playback time | Parallel image processing pipeline |
| Scene Continuity | 92% accuracy | Introduce a video action prediction model |
| Main Limitation | Relies on third-party API rate limits | Deploy real-time models (e.g., GPT 4o / 4o mini realtime) |
- Dynamic Frame Rate Adjustment: Automatically modulate capture intervals (1–4s) based on scene complexity.
- AI Alignment Module:
- Train specialized contrastive learning models to detect semantic discrepancies between narration and video (target: Δ<0.5s).
- Hardware requirement: NVIDIA Jetson Orin (64GB memory with INT8 optimization).
| Component | Specification | Function |
|---|---|---|
| Camera Module | Sony IMX678 (low-light enhanced) | Real-time environment capture (120° FOV) |
| Audio Output | Bone-conduction speakers (leakage-proof) | Privacy-friendly, suitable for outdoor use |
| Local AI Chip | Qualcomm Snapdragon 8 Gen3 AI Engine | Offline lightweight GPT-4 (4-bit quantization) |
- Multi-user Collaborative Annotation: Crowdsourced refinement of AI-generated descriptions, building a high-quality dataset.
- Emergency Detection: Vision agent identifies hazardous scenes (e.g., fire, falls) and triggers alerts.
Ultimate Goal: Develop a universal perception enhancement system for visually impaired individuals, covering scenarios:
- Entertainment: Real-time audio narration for movies, games, and theater.
- Navigation: AR-based route guidance (≥99% obstacle detection accuracy).
- Social Interaction: Face recognition and emotion analysis (“Your friend is 3 meters to the left, smiling”).
Appendix:
- Data Privacy and Compliance (GDPR and HIPAA)
- Third-party API Usage Cost Analysis
- Open-source Repository (Pending Release)

