highlight

highlighted transcript of the Churchill's famous speech
summary, and the clustering result for the same transcript

Inspiration

What it does

Extractive summarisation task. By utilising a neural network(BERT) and a clustering algorithm (k-means clustering), it highlights the important sentences in the transcript of a specific YouTube videos (which could be lectures). It can also cluster sentences with respect to Euclidean similarity. That is, it can show how sentences can be grouped in terms similarity in an unsupervised fashion,

How I built it

extract a transcript from a YouTube video using youtube_data_v3_api
pre-process the transcript
feed the pre-processed transcript to the neural network (Google's BERT), which will then translate the sentences in the transcript into numerical vectors.
cluster the vectors(sentences) by measuring the Euclidean similarity between the sentences
pick the sentence that lies at the center of each cluster, and highlight it!

Challenges I ran into

I have never used scikit-learn before, never used a neural network before. The learning was very steep.
Pre-processing the data was also very difficult.

Accomplishments that I'm proud of

Built an that can "understand" text like humans do, which is fascinating!

What I learned

Coding alone is extremely hard.

What's next for highlight

interactive transcript
tangible performance measure
fine tuning BEFT for a specific domain
Reduce the overhead of serving a very long video (e.g. 50 mins long lecture)

Built With

django
google-bert
google-cloud-youtube-v3-api
k-means
neuralnetwork
python
scikit
tensorflow

Updates

Eu-Bin Kim started this project — Jan 26, 2020 05:14 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.