ByteVid

Website and extension
Deep learning technologies
Speech recognition + nlp process
Slides detection with computer vision process

Say goodbye to long and boring videos! 👋

Powered by the cutting-edge deep learning technologies in 2022, ByteVid transforms long, boring videos into fun byte-sized content.

Be it a one hour long lecture, or a 30-minute zoom meeting, ByteVid can transcribe, summarise the content, extract keywords, detect and extract important slides from the video, and translate into other languages.

Inspiration

When we first encounter the topic of ‘AI and Smart Nation’, we were extremely excited as there were tons of areas that we could explore. Mobility, healthcare, media and entertainment, agriculture, social, sustainability, etc. Hours and hours of time were spent on finding something that intrigue us, but we were unsuccessful. How can we balance our skills and aspirations with the problem we want to work on?

The idea struck upon us as we went back to our roots as ‘students’. With Singapore’s effort in promoting Smart Nation, online recorded lectures and meetings are becoming increasingly prevalent. We struggled with online recorded lectures and meetings because it was difficult to understand what the other person was saying. Some spoke in non-native languages. Some spoke with abnormal accents. While some don’t even provide slides for us students to follow the lecture with. We struggled even more given that an average audience attention span is about 7 minutes, and that current Zoom transcription feature is not so accurate. We could not understand the video content properly and with comfort :(

Therefore, we were inspired and motivated to build a project that will help us, and others extract video information efficiently!

What it does

Be it a one hour long lecture, or a 30-minute zoom meeting, ByteVid can do:

Transcription
Key phrase extraction
Summarisation
Key slides detection and extraction
Translation

How we built it

Frontend

React.js
Tailwind CSS
Deploy on GitHub pages

Backend

Flask server
Deploy on a GPU machine
Relay to an Internet-facing VPS
Nginx reverse proxy
Cloudflare protection

Deep Learning

Whisper: SOTA speech recognition (Sep 2022)
YOLOv7: SOTA object detection (Jul 2022)
KBIR-inspec: key phrase extraction (Dec 2021)
Bert Extractive Summarizer: summarisation (Jun 2019)
BlingFire: sentence extraction
Baidu Translate API: translation

Tools

OpenCV
youtube-dl
ffmpeg

Highlights

In order to solve the lecture slides detection problem, we manually labelled a diverse dataset of 200 lecture videos, ranging from computer science lectures, to business seminars, to zoom meetings. We then successfully trained our own lecture slides detection model on our GPU server.

Challenges we ran into

There is no existing solution for lecture slides detection - we manually labelled hundreds of videos and images and successfully trained our own lecture slides detection model
Our GPU machine has no Internet access - we set up a relay server with autossh port forwarding
The ffmpeg commands are complicated - when we finally succeeded in demystifying them, we feel a sense of achievement
The speech recognition model is relatively slow - we noticed that professors usually speak slowly, so we optimised the performance by speeding up lecture videos by 1.6x before passing them into the speech recognition model
We used up our Baidu translation API free quota during testing - we paid S$10 to buy extra quota
The Baidu translation API has a rate limit - we split paragraph into chunks of sentences and request at a moderate speed
There is no simple method to split paragraphs into sentences (e.g. 3.14 will become two sentences when split by periods) - we utilise the Blingfire model to solve this problem

Accomplishments that we’re proud of

Building and deploying a fully functional AI product in less than 2 days
Our products are a combination of three exciting fields of AI: computer vision, natural language processing and speech processing
We build our own lecture slides dataset and CV model that is better than existing solutions

What we learned

Deploying deep learning models on cloud server
Speech knowledge for speech recognition and video transcription
NLP knowledge for machine translation, summarisation and keyword extraction
CV knowledge for object detection and lecture slide extraction
Developing a chrome extension

What’s next for ByteVid

Auto-navigation to certain timestamps in videos based on keywords
Increase support for other URLs other than YouTube
Implement a Telegram bot
Implement a mobile application

References

[1] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervision,” 2022.
[2] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” arXiv preprint arXiv:2207.02696, 2022.
[3] M. Kulkarni, D. Mahata, R. Arora, and R. Bhowmik, “Learning rich representation of keyphrases from text,” arXiv preprint arXiv:2112.08547, 2021.
[4] D. Miller, “Leveraging BERT for extractive text summarization on lectures,” arXiv preprint arXiv:1906.04165, 2019.