A machine learning-powered sentiment analysis tool that classifies text as positive or negative sentiment. This project won Second Place for accuracy at the 2020 Ignition Hacks Sigma Division.
The Sentiment Analyzer uses a logistic regression model with TF-IDF vectorization to predict sentiment polarity (positive = 1, negative = 0) in text data. The model achieved an F1 score of approximately 0.82.
Project Link: Devpost Submission
- Features
- Prerequisites
- Installation
- Project Structure
- Usage
- Data Cleaning
- Model Optimization
- Performance
- Authors
- License
- Acknowledgments
- ✅ Binary sentiment classification (positive/negative)
- ✅ Preprocessing pipeline with punctuation removal
- ✅ TF-IDF vectorization with bigrams and unigrams
- ✅ Optimized logistic regression model
- ✅ GridSearchCV for hyperparameter tuning
- ✅ F1 score of ~0.82
To run this project on your local machine, you'll need:
- Python 3.6+
- Required Python packages:
pandas- Data manipulation and analysisnumpy- Numerical computingscikit-learn(sklearn) - Machine learning librarynltk- Natural Language Toolkitjoblib- Model serialization
-
Clone the repository:
git clone https://github.com/georgehtliu/sentiment-analyzer.git cd sentiment-analyzer -
Install required packages:
pip install pandas numpy scikit-learn nltk joblib
-
Download NLTK data (if needed):
import nltk nltk.download('punkt') nltk.download('stopwords')
Alternatively, you may run this project on a cloud-hosted development environment such as Google Colab.
Sentiment-Analyzer/
├── README.md # This file
├── LICENSE.md # MIT License
├── training_data.csv # Training dataset
├── contestant_judgment.csv # Judgement/test dataset
├── predicted_labels.csv # Output predictions
├── submission_training.ipynb # Model training notebook
├── submission_createcsv.ipynb # Prediction generation notebook
└── submission_extras.ipynb # Additional experiments and analysis
Note: The trained model files (SentimentNewton_Log.pkl and Vectorizer.pkl) are generated when running the training notebook.
Run the training notebook to train and save the sentiment analysis model:
jupyter notebook submission_training.ipynbThis notebook will:
- Load and preprocess the training data
- Train a logistic regression model with optimized hyperparameters
- Save the trained model as
SentimentNewton_Log.pkl - Save the TF-IDF vectorizer as
Vectorizer.pkl
After training, use the prediction notebook to classify new text:
jupyter notebook submission_createcsv.ipynbThis notebook will:
- Load the trained model and vectorizer
- Process the judgement dataset (
contestant_judgment.csv) - Generate predictions and save them to
predicted_labels.csv
Note: Make sure contestant_judgment.csv is in the same directory, or update the file path in the notebook.
The preprocessing pipeline includes:
- Punctuation removal using regex patterns
- TF-IDF vectorization with the following features:
- Removal of stopwords
- Extraction of both unigrams and bigrams
- Tested lemmatization (results documented in
submission_extras.ipynb)
The team experimented with various cleaning techniques, and found that punctuation removal alone provided the best accuracy for this dataset.
The model was developed through extensive experimentation:
- Tested multiple classifiers including:
- Neural Networks (MLPClassifier)
- Support Vector Machines (SVC)
- Logistic Regression (selected)
- SGD Classifier
- Logistic Regression was chosen for optimal performance
- Used GridSearchCV to optimize logistic regression parameters
- Optimized TF-IDF vectorizer settings:
- Tuned n-gram range (unigrams and bigrams)
- Experimented with stopword removal
- Tested different lemmatization approaches
Detailed exploration of alternative approaches and results can be found in submission_extras.ipynb.
- F1 Score: ~0.82
- Task: Binary sentiment classification
- Model: Optimized Logistic Regression with TF-IDF features
- David Chen
- George Liu
- David Wang
- Michael Yang
This project is licensed under the MIT License - see the LICENSE.md file for details.
- scikit-learn - Extensive use of machine learning tools and algorithms
- NLTK - Natural language processing capabilities
- Ignition Hacks 2020 - Competition organizers and judges