Inspiration
I was inspired by the current political climate and the polarizing effects of social media on political opinions. Russia’s Internet Research Agency created "troll" tweets and posted them on Twitter to influence the political agenda of the United States in approximately the last ten years or so. I thought it would be interesting to see if an ML model could classify these tweets.
What it does
My classifier takes as input a string of text meant to mimic a tweet on twitter. It attempts to classify it into one of six categories: right troll, left troll, newsfeed, hashtag gamer, fearmonger, and nontroll. It will also return the percent certainty of the classification.
How I built it
I combined two datasets from kaggle, one of over 2 million Russian IRA troll tweets and one of about 90,000 regular tweets from various celebrities, politicians, and companies. I wrote my own cleaning function to preserve tweet-specific characters, such as hashtags (#) and user mentions (@), as well as creating my own list of "stop words" to exclude. I trained a Stochastic Gradient Descent model on the training data and achieved an accuracy of about 80%. I also wrote a function to use the classifier to run on any input text and output the classification and the certainty.
Challenges I ran into
A major challenge I overcame was the cleaning function. Many machine learning libraries come with pre-built text cleaning and processing functions, however I was not satisfied with using one because I was afraid it would exclude all of the special characters in tweets that could possibly be important. Creating this function and fitting it with a vectorizer was very difficult but I managed to make it work. The only persisting issue is that it is relatively slow to run due to the large amount of data. Additionally, I tried to use a deep learning model instead, but it was unable to achieve over 30% accuracy because it took many hours to train and there was too much data to process.
What I learned
I learned a lot about data manipulation and cleaning. Especially about the different data types and formats that each step of the machine learning process requires and how to ensure that the data flows smoothly from one part into the next. I also learned a lot about webapps because I had never made one before and I am proud that I was able to make one work.
What's next for Troll Tweet Classifier
In the future I hope to try and get more nontroll data and see if I can achieve better results. Currently, the nontroll data is only about 4% of the total, which causes the classifier to rarely choose this as the most likely category. In addition, I would like to try and use a deep learning model such as a recurrent neural network to try and increase the accuracy and possibly even do some text generation.
Built With
- colab
- nltk
- scikit-learn
Log in or sign up for Devpost to join the conversation.