This project trains logistic regression models for hate speech detection using SemEval 2019 task 6 dataset OLID (Offensive Language Identification Dataset). Task A is to determine whether or not the tweet is offensive. Task B is to determine whether the offensive tweet is targeted. Data and links to task information and paper are available here.
Implements utility functions for loading data in Task A and B.
Example usage:
python util.py olid-training-v1.tsv
Trains logistic regression model with tf-idf vectors for task A and B. Returns following results:
- classification report (using
sklearn) - misclassified examples
- confusion matrix
- explainable results using
shappackage
Example usage:
python logreg.py --train_file olid-training-v1.tsv
Creates a FeatureVectorizer class to add sentiment, subjectivity, profanity, and user name features to the feature function. Then it trains a logistic regression model to evaluate the results on the following different feature combinations.
- base_tfidf + sentiment feature(
vaderSentimentpackage) - base_tfidf + subjectivity feature(
textblobpackage) - base_tfidf + profanity feature(
profanity-checkpackage) - base_tfidf + @user feature (percentage of @USER in a tweet)
Example usage:
python feature_combination.py --train_file olid-training-v1.tsv
See Final_Report.pdf