Twitter sentiment analysis

Project 2 of EPFL Machine Learning course: Twitter sentiment analysis

Authors: Rayan Daod Nathoo, Yann Meier, Kopiga Rasiah

Deadline: 19.12.2019

The goal of this sentiment analysis project is to classify whether a tweet is positve or negative by considering its text only. Language used: Python

Getting started

To start, please clone this repository.

In order to run our project, you will need to install the following modules:

Numpy, Tensorflow, TextBlob, NLTK, Wordsegment, Autocorrect, Keras, Sklearn

Please use usual command as pip install [module] by changing [module] to the corresponding module name.

Folder structure

Create a folder datain the repository at the root of the project.

Inside data, create the folders preprocessed and glove.twitter.27B.

Again inside data, place the training and test sets retrieved from AICrowd.

In the preprocessed folder, create the empty folders neg, pos, test.

Download the file http://nlp.stanford.edu/data/glove.twitter.27B.zip and place it inside the folder glove.twitter.27B. At the end, you should obtain the following folder structure:

├── Tweet-classification/                 
    ├── data/
        ├── glove.twitter.27B/
            ├── glove.twitter.27B.200d.txt      The file regrouping all the pre-trained embedding vectors we used for our                                                    algorithm.
        ├── preprocessed/
        ├── pos/
        ├── neg/
        ├── test/
        ├── train_pos.txt                       The small training set of positive tweets.
        ├── train_pos_full.txt                  The big training set of positive tweets.
        ├── train_neg.txt                       The small training set of negative tweets.
        ├── train_neg_full.txt                  The big training set of negative tweets.
        ├── test_data.txt                       The test set of tweets on which to predict the labels with our algorithm.
        ├── sample_submission.csv               An example of submission file to be submitted on AICrowd.
    ├── src/
        ├── embeddings/
            ├── __init__.py
            ├── cooc.py                         Generates a coocurrence matrix from the words of our vocabulary.
            ├── glove_GD.py                     Implements a Gradient Descent version of Glove.
            ├── stanford_word_embedding.py      Creates a vocabulary based on Twitter pre-trained Stanford Glove vectors.
            ├── tf_idf.py                       Regroups some functions we used for an alternative method.
            ├── tweet_embeddings.py             Creates tweet embeddings from the word embeddings.
            ├── vocab.py                        Takes care about creating a vocabulary from our corpus.
        ├── prediction/
            ├── __init__.py
            ├── better_predict.py               Regroups all the implementations we tried for the training part.
            ├── predict.py                      Stores the two training algorithms we used in the end.
        ├── preprocessing/
            ├── __init__.py
            ├── dictionaries.py                 Regroups the dictionaries we used during the preprocessing part.
            ├── preprocess.py                   Regroups all the preprocessing algorithms we implemented.
       ├── __init__.py
       ├── params.py                            Regroups all the parameters that control this project.
       ├── paths.py                             Regroups all the file paths required for our algorithm.
       ├── run.py                               To be run after the above instructions to execute our pipeline.
    ├── .gitignore

Technical Overview

Preprocessing:

Remove the tweets that are in both positive and negative tweet files
Process the spaces in the training set of tweets
Expand the english contractions contained in the training set
Separate words from hashtags

Embeddings:

The word embeddings were taken in https://nlp.stanford.edu/projects/glove/.
The tweet embeddings are the sum of the word embeddings of the words it contains.

Prediction:

We used different models to predict. The best one was a Neural Network with one hidden layer of 256 nodes (c.f report for more informations).

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
src		src
.gitignore		.gitignore
Project report.pdf		Project report.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twitter sentiment analysis

Getting started

Folder structure

Technical Overview

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Twitter sentiment analysis

Getting started

Folder structure

Technical Overview

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages