DubTok

sequence diagram of the architecture

Inspiration

We were inspired by the need to break down language barriers and make content accessible to a global audience. We saw how difficult and time-consuming it can be to manually dub videos into multiple languages and wanted to create a solution that simplifies this process for everyone.

What it does

DubTok is a web application that allows users to automatically dub their videos into any language. By uploading the video in our application, the user can choose the desired dub language and dub their videos automatically! DubTok creates a voice dub using the same speaker voice in the video, allowing dubbed videos to sound more natural and akin to the original.

How we built it

Our frontend web application is built using Next.js and styled with Tailwind CSS for a modern and responsive user interface. The backend is developed with FastAPI, and we use SQLite to store video metadata. For video storage, we rely on AWS S3.

Our audio dubbing pipeline is organized into several steps:

Audio and Video Separation: Using FFMpeg, the video is isolated from the audio file.
Audio Separation: Utilizing Demucs, the service separates audio into vocals and non-vocals, crucial for creating a clear voice clone and assembling dubbed audio.
Transcription and Timestamping: OpenAI's API transcribes speech and provides word timestamps, essential for subsequent translation and voice cloning.
Translation: The transcribed text is translated into the target language using OpenAI's API, preparing it for voice cloning.
Voice Cloning: PlayHT's API clones the speaker's voice using the isolated vocal track, ensuring authenticity in the dubbed audio.
Voice Dubbing: Using translated transcripts and cloned voice, PlayHT generates sentence-level voice dubs, maintaining narrative coherence.
Timing Adjustment: Each dubbed sentence is meticulously adjusted to match the original's timing without altering pitch, ensuring synchronization with the video.
Assembly: Dubbed sentences are stitched together to form the complete dubbed audio, mixed with non-vocal audio to produce the final dubbed video soundtrack.
Final Output: The dubbed audio track is reassembled with the isolated video, resulting in the completed dubbed video ready for distribution.

APIs used

OpenAI Whisper: For transcribing audio
OpenAI gpt-4o: For translating speech
PlayHT: For cloning speaker's voice and creating dubbed speech
AWS: For storing dubbed video files

Libraries Used

openai python-dotenv demucs numpy<2.0.0 fastapi boto3 sqlalchemy pydantic requests ffmpeg ffmpeg-python pydub pyht

Challenges we ran into

Our project involved navigating a diverse range of technologies, often encountering unfamiliar AI APIs and configuring AWS for backend operations. Achieving seamless integration between these components was a significant hurdle. One of our most demanding tasks was ensuring flawless synchronization between dubbed audio and video while preserving the natural cadence of speech. This required developing high-fidelity voice clones and meticulously aligning translated audio with original speech timestamps. Additionally, we prioritized adherence to best coding practices, anticipating seamless integration with platforms like TikTok.

Accomplishments that we're proud of

We are proud that we managed to seamlessly integrate multiple technologies in our solution and create a production ready dubbed audio service which creators can take advantage of, especially given that we might not have experience with some of the technologies that we used.

What we learned

Throughout this project, we embarked on a profound journey into audio processing and the intricate world of speech synthesis. We encountered significant challenges in handling non-Latin speech and ensuring flawless lip sync. Our key takeaway was mastering quality control in the face of unpredictable ML and AI model outputs. Moreover, integrating a diverse array of APIs and components into our application provided invaluable experience. These learnings underscore our commitment to delivering robust solutions in the realm of multimedia and AI technologies.

What's next for DubTok

The next steps for DubTok include adding support for dubbing the voices of different speakers in a video. Each unique speaker will have their own distinct dubbed voice, making the dubbing more realistic and tailored.

In addition to dubbing, we plan to incorporate lip-syncing technology into DubTok. This will ensure that the dubbed audio matches the lip movements of the speakers in the video, making the content look more natural and immersive, allowing viewers to have a more engaging experience, as it will appear as though the person is truly speaking the dubbed words.

Furthermore, we aim to include the ability to automatically generate subtitles based on the audio. This feature will enhance accessibility and comprehension for viewers, and support creators in creating better videos.

Lastly, we aim to package our solution into a plugin which creators on social media platforms can use seamlessly to boost their productivity