YouTube Comment Analytics with Kafka, PySpark, and Hugging Face

This project provides a real-time analytics pipeline for YouTube comments, using:

Kafka for streaming comments.
PySpark for distributed data processing.
Hugging Face models for topic modeling and toxicity detection.
A simple engagement prediction function to estimate user interest.

We designed this project to perform sentiment analysis and categorization of YouTube comments using AI-powered models from Hugging Face, making it easy to classify content and detect toxicity in real time. The system can also estimate engagement levels by evaluating the nature of user comments.

Project Structure

Yotube Comment Analytics/
├── docker-compose.yml       # Docker configuration
├── config.py                # Configuration file with API key, Kafka settings, and video ID
├── kafka/
│   └── producer.py           # Kafka producer script to fetch and send YouTube comments
├── spark/
│   ├── Dockerfile            # PySpark consumer image configuration
│   ├── requirements.txt      # Dependencies for PySpark consumer
│   └── pyspark_app/
│       └── consumer.py       # PySpark consumer script for topic classification and toxicity detection
└── README.md                 # Project documentation

Pipeline Overview

Kafka Producer (producer.py):
- Fetches YouTube comments for a given video ID using the YouTube Data API.
- Sends the comments to a Kafka topic (youtube-comments).
PySpark Consumer (consumer.py):
- Subscribes to the Kafka topic.
- Performs:
  - Topic Classification using Hugging Face (facebook/bart-large-mnli).
  - Toxicity Detection using Hugging Face (unitary/toxic-bert).
  - Engagement Prediction (rule-based, based on comment features).

Analysis and AI Components

Sentiment Analysis: Provides insights into the emotional tone of comments (e.g., positive, neutral, negative).
Toxicity Detection: Identifies whether a comment contains offensive or harmful language.
**Keyword Frequency Analysis: ** Calculates the most frequent keywords or phrases used in the comments to understand trending topics.

This project demonstrates the power of AI-driven analytics by combining real-time streaming with natural language processing (NLP). The PySpark consumer leverages Hugging Face models to provide insights into the nature of user comments, detecting offensive content, assigning topics, and predicting user engagement.

Setup Instructions

Pre-requisites

Docker installed and running.
Python 3.8+ with venv or conda installed (for local development).
A YouTube Data API key: Get this from Google Cloud Console.

Steps

Clone this repository:

git clone https://github.com/jwalith/Youtube-Comment-Analytics.git
cd Youtube-Comment-Analytics

Create and activate your Python virtual environment:

python -m venv .venv
.\.venv\Scripts\Activate.ps1  # Windows
source .venv/bin/activate     # macOS/Linux

Install dependencies (for the YouTube comments producer):

pip install google-api-python-client kafka-python python-dotenv

Add your YouTube API key and video ID to config.py:

config = {
    "google_api_key": "YOUR_API_KEY",
    "playlistId": "YOUR_PLAYLIST_ID",
    "KAFKA_BROKER": "localhost:9092",
    "TOPIC_NAME": "youtube-comments",
    "video_id": "YOUR_VIDEO_ID"
}

Start the Kafka and PySpark environment using Docker:
```
docker-compose up -d
```
Run the YouTube comments producer:
```
python kafka/producer.py
```
Monitor PySpark consumer logs (for real-time processing):
```
docker logs -f spark_consumer
```

Future Enhancements

Add sentiment analysis and language translation.
Store processed results in a database or dashboard (e.g., Postgres, Elasticsearch).
Improve engagement prediction using a trained ML model.
Handle large datasets using Spark clusters with multiple worker nodes.

This project is licensed under the MIT License.

Feel free to submit pull requests or issues for improvements and feature requests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YouTube Comment Analytics with Kafka, PySpark, and Hugging Face

Project Structure

Pipeline Overview

Analysis and AI Components

Setup Instructions

Pre-requisites

Steps

Future Enhancements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.venv		.venv
Kafka		Kafka
spark		spark
.env		.env
README.md		README.md
config.py		config.py
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

YouTube Comment Analytics with Kafka, PySpark, and Hugging Face

Project Structure

Pipeline Overview

Analysis and AI Components

Setup Instructions

Pre-requisites

Steps

Future Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages