Project 2 of EPFL Machine Learning course: Twitter sentiment analysis
Authors: Rayan Daod Nathoo, Yann Meier, Kopiga Rasiah
Deadline: 19.12.2019
The goal of this sentiment analysis project is to classify whether a tweet is positve or negative by considering its text only. Language used: Python
To start, please clone this repository.
In order to run our project, you will need to install the following modules:
Numpy, Tensorflow, TextBlob, NLTK, Wordsegment, Autocorrect, Keras, Sklearn
Please use usual command as pip install [module] by changing [module] to the corresponding module name.
Create a folder datain the repository at the root of the project.
Inside data, create the folders preprocessed and glove.twitter.27B.
Again inside data, place the training and test sets retrieved from AICrowd.
In the preprocessed folder, create the empty folders neg, pos, test.
Download the file http://nlp.stanford.edu/data/glove.twitter.27B.zip and place it inside the folder glove.twitter.27B. At the end, you should obtain the following folder structure:
├── Tweet-classification/
├── data/
├── glove.twitter.27B/
├── glove.twitter.27B.200d.txt The file regrouping all the pre-trained embedding vectors we used for our algorithm.
├── preprocessed/
├── pos/
├── neg/
├── test/
├── train_pos.txt The small training set of positive tweets.
├── train_pos_full.txt The big training set of positive tweets.
├── train_neg.txt The small training set of negative tweets.
├── train_neg_full.txt The big training set of negative tweets.
├── test_data.txt The test set of tweets on which to predict the labels with our algorithm.
├── sample_submission.csv An example of submission file to be submitted on AICrowd.
├── src/
├── embeddings/
├── __init__.py
├── cooc.py Generates a coocurrence matrix from the words of our vocabulary.
├── glove_GD.py Implements a Gradient Descent version of Glove.
├── stanford_word_embedding.py Creates a vocabulary based on Twitter pre-trained Stanford Glove vectors.
├── tf_idf.py Regroups some functions we used for an alternative method.
├── tweet_embeddings.py Creates tweet embeddings from the word embeddings.
├── vocab.py Takes care about creating a vocabulary from our corpus.
├── prediction/
├── __init__.py
├── better_predict.py Regroups all the implementations we tried for the training part.
├── predict.py Stores the two training algorithms we used in the end.
├── preprocessing/
├── __init__.py
├── dictionaries.py Regroups the dictionaries we used during the preprocessing part.
├── preprocess.py Regroups all the preprocessing algorithms we implemented.
├── __init__.py
├── params.py Regroups all the parameters that control this project.
├── paths.py Regroups all the file paths required for our algorithm.
├── run.py To be run after the above instructions to execute our pipeline.
├── .gitignore
- Preprocessing:
- Remove the tweets that are in both positive and negative tweet files
- Process the spaces in the training set of tweets
- Expand the english contractions contained in the training set
- Separate words from hashtags
- Embeddings:
- The word embeddings were taken in https://nlp.stanford.edu/projects/glove/.
- The tweet embeddings are the sum of the word embeddings of the words it contains.
- Prediction:
- We used different models to predict. The best one was a Neural Network with one hidden layer of 256 nodes (c.f report for more informations).