Skip to content

rayandaod/Twitter-sentiment-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Twitter sentiment analysis

Project 2 of EPFL Machine Learning course: Twitter sentiment analysis

Authors: Rayan Daod Nathoo, Yann Meier, Kopiga Rasiah

Deadline: 19.12.2019

The goal of this sentiment analysis project is to classify whether a tweet is positve or negative by considering its text only. Language used: Python

Getting started

To start, please clone this repository.

In order to run our project, you will need to install the following modules:

Numpy, Tensorflow, TextBlob, NLTK, Wordsegment, Autocorrect, Keras, Sklearn

Please use usual command as pip install [module] by changing [module] to the corresponding module name.

Folder structure

Create a folder datain the repository at the root of the project.

Inside data, create the folders preprocessed and glove.twitter.27B.

Again inside data, place the training and test sets retrieved from AICrowd.

In the preprocessed folder, create the empty folders neg, pos, test.

Download the file http://nlp.stanford.edu/data/glove.twitter.27B.zip and place it inside the folder glove.twitter.27B. At the end, you should obtain the following folder structure:

├── Tweet-classification/                 
    ├── data/
        ├── glove.twitter.27B/
            ├── glove.twitter.27B.200d.txt      The file regrouping all the pre-trained embedding vectors we used for our                                                    algorithm.
        ├── preprocessed/
        ├── pos/
        ├── neg/
        ├── test/
        ├── train_pos.txt                       The small training set of positive tweets.
        ├── train_pos_full.txt                  The big training set of positive tweets.
        ├── train_neg.txt                       The small training set of negative tweets.
        ├── train_neg_full.txt                  The big training set of negative tweets.
        ├── test_data.txt                       The test set of tweets on which to predict the labels with our algorithm.
        ├── sample_submission.csv               An example of submission file to be submitted on AICrowd.
    ├── src/
        ├── embeddings/
            ├── __init__.py
            ├── cooc.py                         Generates a coocurrence matrix from the words of our vocabulary.
            ├── glove_GD.py                     Implements a Gradient Descent version of Glove.
            ├── stanford_word_embedding.py      Creates a vocabulary based on Twitter pre-trained Stanford Glove vectors.
            ├── tf_idf.py                       Regroups some functions we used for an alternative method.
            ├── tweet_embeddings.py             Creates tweet embeddings from the word embeddings.
            ├── vocab.py                        Takes care about creating a vocabulary from our corpus.
        ├── prediction/
            ├── __init__.py
            ├── better_predict.py               Regroups all the implementations we tried for the training part.
            ├── predict.py                      Stores the two training algorithms we used in the end.
        ├── preprocessing/
            ├── __init__.py
            ├── dictionaries.py                 Regroups the dictionaries we used during the preprocessing part.
            ├── preprocess.py                   Regroups all the preprocessing algorithms we implemented.
       ├── __init__.py
       ├── params.py                            Regroups all the parameters that control this project.
       ├── paths.py                             Regroups all the file paths required for our algorithm.
       ├── run.py                               To be run after the above instructions to execute our pipeline.
    ├── .gitignore

Technical Overview

  1. Preprocessing:
  • Remove the tweets that are in both positive and negative tweet files
  • Process the spaces in the training set of tweets
  • Expand the english contractions contained in the training set
  • Separate words from hashtags
  1. Embeddings:
  1. Prediction:
  • We used different models to predict. The best one was a Neural Network with one hidden layer of 256 nodes (c.f report for more informations).

About

Text Sentiment Analysis on a 2.5M-tweets dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors