(NLP Pipeline with NLTK + Scikit-learn)
This project applies Natural Language Processing and Machine Learning to classify text statements into emotional categories like Anxiety, Depression, Relief, and more. It uses a combination of NLTK for preprocessing and scikit-learn for modeling and evaluation.
This project uses the dataset:
🔗 Sentiment Analysis for Mental Health – Kaggle
- Contains thousands of labeled statements collected from mental health-related sources.
- Each entry has a
statement(text) and astatus(emotion category). - The dataset focuses on mental health sentiment and reflects a wide range of human emotions like:
AnxietyDepressionLonelinessOptimismGratitudeRelief
This makes the dataset ideal for building models that help understand emotional expression in real-world mental health contexts.
- Prepares and cleans text using NLTK tokenization and stopwords.
- Converts text into numerical features using CountVectorizer.
- Trains a Linear Support Vector Classifier (LinearSVC).
- Evaluates the model using accuracy, precision, recall, F1-score.
- Uses RandomizedSearchCV to optimize the
Chyperparameter. - Outputs the best model and its performance metrics.
- There are learnable patterns in the language used to describe emotional states.
- Even a simple model like
LinearSVCachieves ~75% accuracy, showing that:- Words and phrases strongly correlate with specific emotions.
- Emotional text can be quantified and predicted with solid performance.
- The model generalizes well across multiple emotional labels, especially after tuning
Cto0.1.
In other words:
The language people use when expressing mental health concerns contains enough signal for a machine learning model to recognize and classify emotions with meaningful accuracy.
nltk.download('punkt')
nltk.download('stopwords')data = pd.read_csv('Combined Data.csv')
X = data['statement'].fillna("").astype(str)
y = data['status']vectorizer = CountVectorizer()
X_train_features = vectorizer.fit_transform(X_train)clf = LinearSVC()
clf.fit(X_train_features, y_train)accuracy_score(y_test, y_pred)
precision_score(...)
confusion_matrix(...)RandomizedSearchCV(..., param_distributions={'C': [...]})| Metric | Score |
|---|---|
| Accuracy | 74.5% |
| Precision | 74.1% |
| Recall | 74.5% |
| F1 Score | 74.2% |
| Best C Value | 0.1 |
| Best CV Score | 75.2% |
To run the code:
- Download the dataset from Kaggle.
- Install required libraries:
pip install nltk scikit-learn pandas- Upload the CSV file and run the notebook.