Inspiration

Wikipedia dataset is really about the linguistic analysis, so we want to apply the methods and models used in NLP for prediction.

What it does

Firstly it analyzes the part of spech combination of all the insertion phrases through nltk, and stores those combinations as a list. Through jieba and gensim, it calculates the similarity between the new input wikipedia piece to each of the base sentence, and select the highest similarity from the results. Find that specific base sentence according to the similarity and find out the relevant insertion phrase. According to that insertion phrase, we could predict it as the most possible part of speech combination of the insertion to our new input sentence.

How I built it

We select the first 100000rows of the dataset and split the three columns into three lists, which is a list of all base senteces, a list of all insertion phrases, and a list of all edited sentences. We then import nltk and analyze all emelents in the list of all insertion phrases and store the relevant combination of the part of speech to a new list. Then we build a model for calculating the similarity through gensim for similarity and jieba for separation. First we ask the user to input a new wikipedia base sentence. we separate each sentence in both the original base_sentence list and new input sentence into words and form a dictionary of word bags. The word bag contains the index of each unique word and the frequencies of the word appearing in the whole file. We calculate the tf-idf value of each word to see how specific the word is to certain wikipedia piece. With tf-idf values, we use doc2bow to convert the word bag into vectors, and calculate the similarity between input and original sentences in the base_sentence list. We select the highest similarity and find the relevant base sentence in the original base sentence list. According to that base sentence, we find out the relevant insertion phrase and its part of speech combination. We assume that combination as the most possible combination of part of speech of our input sentence.

Challenges I ran into

the huge size of our dataset, which makes our calculation super hard. how to relate the similarity with the prediction of part of speech.

Accomplishments that I'm proud of

I think our project is really creative to predict the potential part of speech of the insertion. we use the similarity matrix for prediction which I think is bold.

What I learned

how to use gensim and nltk the mathematical knowledge about similarity calculation

What's next for InsertionPOSPrediction

Make it more accurate with more data predict the possible location and content of insertion

Built With

Share this project:

Updates