Say goodbye to long and boring videos! 👋

Powered by the cutting-edge deep learning technologies in 2022, ByteVid transforms long, boring videos into fun byte-sized content.

Be it a one hour long lecture, or a 30-minute zoom meeting, ByteVid can transcribe, summarise the content, extract keywords, detect and extract important slides from the video, and translate into other languages.

Inspiration

When we first encounter the topic of ‘AI and Smart Nation’, we were extremely excited as there were tons of areas that we could explore. Mobility, healthcare, media and entertainment, agriculture, social, sustainability, etc. Hours and hours of time were spent on finding something that intrigue us, but we were unsuccessful. How can we balance our skills and aspirations with the problem we want to work on?

The idea struck upon us as we went back to our roots as ‘students’. With Singapore’s effort in promoting Smart Nation, online recorded lectures and meetings are becoming increasingly prevalent. We struggled with online recorded lectures and meetings because it was difficult to understand what the other person was saying. Some spoke in non-native languages. Some spoke with abnormal accents. While some don’t even provide slides for us students to follow the lecture with. We struggled even more given that an average audience attention span is about 7 minutes, and that current Zoom transcription feature is not so accurate. We could not understand the video content properly and with comfort :(

Therefore, we were inspired and motivated to build a project that will help us, and others extract video information efficiently!

What it does

Be it a one hour long lecture, or a 30-minute zoom meeting, ByteVid can do:

  • Transcription
  • Key phrase extraction
  • Summarisation
  • Key slides detection and extraction
  • Translation

How we built it

Frontend

  • React.js
  • Tailwind CSS
  • Deploy on GitHub pages

Backend

  • Flask server
  • Deploy on a GPU machine
  • Relay to an Internet-facing VPS
  • Nginx reverse proxy
  • Cloudflare protection

Deep Learning

Tools

  • OpenCV
  • youtube-dl
  • ffmpeg

Highlights

In order to solve the lecture slides detection problem, we manually labelled a diverse dataset of 200 lecture videos, ranging from computer science lectures, to business seminars, to zoom meetings. We then successfully trained our own lecture slides detection model on our GPU server.

Challenges we ran into

  • There is no existing solution for lecture slides detection - we manually labelled hundreds of videos and images and successfully trained our own lecture slides detection model
  • Our GPU machine has no Internet access - we set up a relay server with autossh port forwarding
  • The ffmpeg commands are complicated - when we finally succeeded in demystifying them, we feel a sense of achievement
  • The speech recognition model is relatively slow - we noticed that professors usually speak slowly, so we optimised the performance by speeding up lecture videos by 1.6x before passing them into the speech recognition model
  • We used up our Baidu translation API free quota during testing - we paid S$10 to buy extra quota
  • The Baidu translation API has a rate limit - we split paragraph into chunks of sentences and request at a moderate speed
  • There is no simple method to split paragraphs into sentences (e.g. 3.14 will become two sentences when split by periods) - we utilise the Blingfire model to solve this problem

Accomplishments that we’re proud of

  • Building and deploying a fully functional AI product in less than 2 days
  • Our products are a combination of three exciting fields of AI: computer vision, natural language processing and speech processing
  • We build our own lecture slides dataset and CV model that is better than existing solutions

What we learned

  • Deploying deep learning models on cloud server
  • Speech knowledge for speech recognition and video transcription
  • NLP knowledge for machine translation, summarisation and keyword extraction
  • CV knowledge for object detection and lecture slide extraction
  • Developing a chrome extension

What’s next for ByteVid

  • Auto-navigation to certain timestamps in videos based on keywords
  • Increase support for other URLs other than YouTube
  • Implement a Telegram bot
  • Implement a mobile application

References

[1] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervision,” 2022.
[2] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” arXiv preprint arXiv:2207.02696, 2022.
[3] M. Kulkarni, D. Mahata, R. Arora, and R. Bhowmik, “Learning rich representation of keyphrases from text,” arXiv preprint arXiv:2112.08547, 2021.
[4] D. Miller, “Leveraging BERT for extractive text summarization on lectures,” arXiv preprint arXiv:1906.04165, 2019.

Built With

Share this project:

Updates