Trusty Krusty Reviews - Review Quality Assessment System

Developed by Team JimJamSlam

Project Overview

Trusty Krusty Reviews is an ML-based system designed to evaluate the quality and relevancy of Google location reviews. Our solution improves trust in reviews by automatically classifying them and ensuring fair representation of locations through intelligent filtering and highlighting of problematic content.

Problem Statement

Challenge: Design and implement an ML-based system to evaluate the quality and relevancy of Google location reviews.

Our Solution: We developed a comprehensive review classification system that:

Automatically categorizes reviews as Valid, Advertisement, Irrelevant, or Rant
Provides confidence scores and violation detection
Highlights suspicious content with visual indicators
Offers both rule-based and ML-based classification approaches
Enables bulk processing and export of cleaned datasets

This addresses the core challenge of maintaining review quality and relevancy at scale, helping users make informed decisions based on trustworthy location-based feedback.

Development Tools Used

IDE/Development Environment: VSCode, Jupyter Notebook
Version Control: Git/GitHub
Data Processing: Jupyter Notebook for model training (training.ipynb)
Web Framework: Streamlit for the interactive application
Model Development: Local Python environment with GPU support

APIs Used

Google Maps API: For gathering business details and location data (googlemaps library)
Apify APIs:
- Google Maps Reviews Scraper: User review collection
- Google Maps Scraper: Business details and associated reviews

Note: The system was designed to work locally without requiring external API keys for inference, making it suitable for hackathon environments with cost constraints.

Libraries and Frameworks Used

Core ML/AI Libraries

PyTorch (2.6.0): Deep learning framework for model training and inference
Transformers (4.52.0): Hugging Face transformers for pre-trained language models
Sentence-Transformers (3.4.1): For text embeddings and similarity calculations
scikit-learn (1.6.1): Traditional ML algorithms and evaluation metrics
Datasets (4.0.0): Hugging Face datasets for data handling
SafeTensors: Secure tensor serialization

Data Processing & Analysis

pandas (2.2.3): Data manipulation and analysis
numpy (2.0.2): Numerical computing
emoji (2.14.1): Text preprocessing for emoji handling
googletrans (4.0.0rc1): Language translation utilities

Web Application

Streamlit: Interactive web application framework
Streamlit-Folium: Map visualization components

Utilities

python-dotenv: Environment variable management
watchdog: File system monitoring
matplotlib: Data visualization
seaborn: Statistical data visualization

Assets and Datasets Used

Primary Dataset

Google Local Reviews Dataset: 3,800 Google Maps reviews collected via Apify scrapers
Manual Labeling: Semi-automated approach using ChatGPT-5 for initial labels, followed by manual validation

Dataset Structure

Located in dataset/ directory:

all_reviews.csv: Complete raw dataset
final_df.csv: Processed dataset with features
label_data.csv: Labeled training data
places_data.csv: Business location metadata

Sample Data

assets/sample_reviews.csv: Test dataset with 10 representative reviews for demonstration

Classification Categories

Valid (High Quality): Relevant, meaningful feedback about the location
Advertisement: Promotional content or external links unrelated to business
Irrelevant: Content unrelated to the location or business
Rant: Emotional, unconstructive, or inappropriate content

Features

Automated Review Classification: Multi-class classification with confidence scores
Dual-Mode Operation: Business Mode (CSV upload) and Places Mode (live Google Places search)
Batch Processing: Handle large CSV datasets efficiently
Interactive Dashboard: Real-time metrics and before/after comparison
Export Functionality: Download complete datasets with ML predictions and classifications
Compact Score Display: Clean table format showing all classification confidence scores
Local Processing: No external API dependencies for inference

Setup Instructions

Prerequisites

Python 3.8 or higher
Git (optional, for cloning)

Installation

Clone the repository

git clone https://github.com/alvinnnnnnnnnn/TikTokJamFest.git
cd TikTokJamFest

Create and activate virtual environment

# Create virtual environment
python3 -m venv venv

# Activate (macOS/Linux)
source venv/bin/activate

# Activate (Windows)
venv\Scripts\activate

Upgrade pip and install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Usage

Running the Application

streamlit run app.py

The app will open automatically at http://localhost:8501

Basic Workflow

Business Mode (CSV Analysis)

Select Business Mode: Click "🏢 Business Mode - CSV Analysis" in the sidebar
Upload CSV Data: Use the main area file uploader
Review Results:
- Feed Tab: Compare raw vs processed reviews with classification badges and scores
- Metrics Tab: View classification statistics and breakdowns
- Export Tab: Download complete dataset with ML predictions

Places Mode (Live Google Places Search)

Select Places Mode: Click "🌍 Places Mode - Live Review Analysis" in the sidebar
Enable Location: Grant location permission for nearby search
Search Places: Enter business names or categories to find locations
Browse Results: Navigate through places with pagination and view classified reviews

CSV Format Requirements

Your input CSV should contain these columns:

review_id or id: Unique identifier
review_text or text: The review content
rating or score: Star rating (1-5)
user or reviewer: Username/reviewer
timestamp or date: Review date/time

How to Reproduce Results

1. Model Training

# Open the training notebook
jupyter notebook training.ipynb

# Follow the notebook cells to:
# - Load and preprocess the dataset
# - Engineer features (review length, relevance scores)
# - Train the multi-modal model
# - Evaluate performance metrics

2. Testing with Sample Data

# Run the app
streamlit run app.py

# Upload the sample dataset
# File: assets/sample_reviews.csv

# Expected results:
# - 4 Valid reviews (Sarah_M, John_D, CoupleGoals, FoodieExpert)
# - 2 Ad reviews (PromoBot_123, FakeAccount_99)
# - 2 Rant reviews (AngryCustomer, LazyReviewer)
# - 2 Irrelevant reviews (RandomReviewer, SecondHand_Info)

3. Performance Evaluation

The model achieves the following metrics on our test dataset:

Metric	Validation Set	Test Set
Weighted F1-Score	0.81	0.78
Weighted Precision	0.78	0.75
Weighted Recall	0.84	0.81

4. Large Dataset Testing

Generate a larger test dataset for performance evaluation:

import pandas as pd
import random

base_reviews = [
    'Great food and amazing service!',
    'Visit our website www.deals.com for offers!',
    'TERRIBLE PLACE!!!! WORST EVER!!!!',
    'Never been here but heard bad things.',
    'Nice atmosphere, good food.',
]

# Generate 1000 rows for testing
data = []
for i in range(1000):
    review = random.choice(base_reviews)
    data.append({
        'review_text': f'{review} #{i}',
        'rating': random.randint(1, 5),
        'user': f'User_{i}',
        'timestamp': f'2024-01-{(i%30)+1:02d}'
    })

df = pd.DataFrame(data)
df.to_csv('large_test.csv', index=False)

Implementation Details

1. Feature Engineering

From the raw dataset, additional features were engineered:

Review Length: Word count analysis for quality assessment
Relevance Scoring: Cross-encoder model comparing review text with business category keywords (0-1 similarity score)

2. Multi-Modal Model Architecture

Text Encoder: DistilRoBERTa for review text processing
Feature Integration: Combines text embeddings with numerical features
Classification Head: Multi-class output with confidence scores

3. Rule-Based Classification

Complementary rule-based system for high-precision detection:

Empty/missing text → Low Quality
URL detection → Advertisement check
Keyword matching → Advertisement flagging
Pattern recognition → Rant/spam detection

Troubleshooting

Common Issues

Command Not Found Error:

# Ensure virtual environment is activated
source venv/bin/activate
pip install -r requirements.txt
streamlit --version  # Verify installation

Port Already in Use:

streamlit run app.py --server.port 8502

Import Errors:

Ensure you're in the project root directory
Verify all dependencies are installed
Check Python path includes project directory

Deployment

This application can be deployed to:

Streamlit Cloud (recommended)
Hugging Face Spaces
Any platform supporting Streamlit apps

Reflections & Future Improvements

Current Limitations

Dataset Imbalance: Majority of reviews categorized as High Quality, potentially causing model bias
Limited User Context: Could benefit from reviewer history and profile metadata for stronger signals
Language Support: Currently optimized for English reviews

Future Enhancements

Active Learning: Incorporate user feedback to improve model accuracy
Multilingual Support: Extend classification to multiple languages
Real-time Processing: Stream processing for live review analysis
Advanced Features: Sentiment analysis, topic modeling, and trend detection

Team Contributions

Alvin

Data Collection: Scraping data from Apify using Google Maps Reviews Scraper and Google Maps Scraper
Data Processing: Development of code for data preprocessing and feature engineering
Model Development: Development and training of the multi-modal machine learning model
Backend Logic: Implementation of classification algorithms and rule-based systems

Kerway

Model Development: Development and training of machine learning model architecture
Documentation: Devpost submission preparation and technical documentation
UI Development: User interface design and implementation for the Streamlit application
Project Management: Coordination of development workflows and deliverables

Yuechen

Web Application: Full-stack development of the Streamlit web application
Video Production: Creation of demonstration video showcasing the system capabilities
Team Coordination: Project management, timeline coordination, and team communication
Integration: System integration and deployment preparation

License

This project is open source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
assets		assets
dataset		dataset
models		models
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
training.ipynb		training.ipynb

Folders and files

Latest commit

History

Repository files navigation

Trusty Krusty Reviews - Review Quality Assessment System

Project Overview

Problem Statement

Development Tools Used

APIs Used

Libraries and Frameworks Used

Core ML/AI Libraries

Data Processing & Analysis

Web Application

Utilities

Assets and Datasets Used

Primary Dataset

Dataset Structure

Sample Data

Classification Categories

Features

Setup Instructions

Prerequisites

Installation

Usage

Running the Application

Basic Workflow

Business Mode (CSV Analysis)

Places Mode (Live Google Places Search)

CSV Format Requirements

How to Reproduce Results

1. Model Training

2. Testing with Sample Data

3. Performance Evaluation

4. Large Dataset Testing

Implementation Details

1. Feature Engineering

2. Multi-Modal Model Architecture

3. Rule-Based Classification

Troubleshooting

Common Issues

Deployment

Reflections & Future Improvements

Current Limitations

Future Enhancements

Team Contributions

Alvin

Kerway

Yuechen

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages