Skip to content

alvinnnnnnnnnn/TikTokJamFest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

52 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Trusty Krusty Reviews - Review Quality Assessment System

Developed by Team JimJamSlam

Project Overview

Trusty Krusty Reviews is an ML-based system designed to evaluate the quality and relevancy of Google location reviews. Our solution improves trust in reviews by automatically classifying them and ensuring fair representation of locations through intelligent filtering and highlighting of problematic content.

Problem Statement

Challenge: Design and implement an ML-based system to evaluate the quality and relevancy of Google location reviews.

Our Solution: We developed a comprehensive review classification system that:

  • Automatically categorizes reviews as Valid, Advertisement, Irrelevant, or Rant
  • Provides confidence scores and violation detection
  • Highlights suspicious content with visual indicators
  • Offers both rule-based and ML-based classification approaches
  • Enables bulk processing and export of cleaned datasets

This addresses the core challenge of maintaining review quality and relevancy at scale, helping users make informed decisions based on trustworthy location-based feedback.


Development Tools Used

  • IDE/Development Environment: VSCode, Jupyter Notebook
  • Version Control: Git/GitHub
  • Data Processing: Jupyter Notebook for model training (training.ipynb)
  • Web Framework: Streamlit for the interactive application
  • Model Development: Local Python environment with GPU support

APIs Used

Note: The system was designed to work locally without requiring external API keys for inference, making it suitable for hackathon environments with cost constraints.

Libraries and Frameworks Used

Core ML/AI Libraries

  • PyTorch (2.6.0): Deep learning framework for model training and inference
  • Transformers (4.52.0): Hugging Face transformers for pre-trained language models
  • Sentence-Transformers (3.4.1): For text embeddings and similarity calculations
  • scikit-learn (1.6.1): Traditional ML algorithms and evaluation metrics
  • Datasets (4.0.0): Hugging Face datasets for data handling
  • SafeTensors: Secure tensor serialization

Data Processing & Analysis

  • pandas (2.2.3): Data manipulation and analysis
  • numpy (2.0.2): Numerical computing
  • emoji (2.14.1): Text preprocessing for emoji handling
  • googletrans (4.0.0rc1): Language translation utilities

Web Application

  • Streamlit: Interactive web application framework
  • Streamlit-Folium: Map visualization components

Utilities

  • python-dotenv: Environment variable management
  • watchdog: File system monitoring
  • matplotlib: Data visualization
  • seaborn: Statistical data visualization

Assets and Datasets Used

Primary Dataset

  • Google Local Reviews Dataset: 3,800 Google Maps reviews collected via Apify scrapers
  • Manual Labeling: Semi-automated approach using ChatGPT-5 for initial labels, followed by manual validation

Dataset Structure

Located in dataset/ directory:

  • all_reviews.csv: Complete raw dataset
  • final_df.csv: Processed dataset with features
  • label_data.csv: Labeled training data
  • places_data.csv: Business location metadata

Sample Data

  • assets/sample_reviews.csv: Test dataset with 10 representative reviews for demonstration

Classification Categories

  1. Valid (High Quality): Relevant, meaningful feedback about the location
  2. Advertisement: Promotional content or external links unrelated to business
  3. Irrelevant: Content unrelated to the location or business
  4. Rant: Emotional, unconstructive, or inappropriate content

Features

  • Automated Review Classification: Multi-class classification with confidence scores
  • Dual-Mode Operation: Business Mode (CSV upload) and Places Mode (live Google Places search)
  • Batch Processing: Handle large CSV datasets efficiently
  • Interactive Dashboard: Real-time metrics and before/after comparison
  • Export Functionality: Download complete datasets with ML predictions and classifications
  • Compact Score Display: Clean table format showing all classification confidence scores
  • Local Processing: No external API dependencies for inference

Setup Instructions

Prerequisites

  • Python 3.8 or higher
  • Git (optional, for cloning)

Installation

  1. Clone the repository

    git clone https://github.com/alvinnnnnnnnnn/TikTokJamFest.git
    cd TikTokJamFest
  2. Create and activate virtual environment

    # Create virtual environment
    python3 -m venv venv
    
    # Activate (macOS/Linux)
    source venv/bin/activate
    
    # Activate (Windows)
    venv\Scripts\activate
  3. Upgrade pip and install dependencies

    pip install --upgrade pip
    pip install -r requirements.txt

Usage

Running the Application

streamlit run app.py

The app will open automatically at http://localhost:8501

Basic Workflow

Business Mode (CSV Analysis)

  1. Select Business Mode: Click "🏒 Business Mode - CSV Analysis" in the sidebar
  2. Upload CSV Data: Use the main area file uploader
  3. Review Results:
    • Feed Tab: Compare raw vs processed reviews with classification badges and scores
    • Metrics Tab: View classification statistics and breakdowns
    • Export Tab: Download complete dataset with ML predictions

Places Mode (Live Google Places Search)

  1. Select Places Mode: Click "🌍 Places Mode - Live Review Analysis" in the sidebar
  2. Enable Location: Grant location permission for nearby search
  3. Search Places: Enter business names or categories to find locations
  4. Browse Results: Navigate through places with pagination and view classified reviews

CSV Format Requirements

Your input CSV should contain these columns:

  • review_id or id: Unique identifier
  • review_text or text: The review content
  • rating or score: Star rating (1-5)
  • user or reviewer: Username/reviewer
  • timestamp or date: Review date/time

How to Reproduce Results

1. Model Training

# Open the training notebook
jupyter notebook training.ipynb

# Follow the notebook cells to:
# - Load and preprocess the dataset
# - Engineer features (review length, relevance scores)
# - Train the multi-modal model
# - Evaluate performance metrics

2. Testing with Sample Data

# Run the app
streamlit run app.py

# Upload the sample dataset
# File: assets/sample_reviews.csv

# Expected results:
# - 4 Valid reviews (Sarah_M, John_D, CoupleGoals, FoodieExpert)
# - 2 Ad reviews (PromoBot_123, FakeAccount_99)
# - 2 Rant reviews (AngryCustomer, LazyReviewer)
# - 2 Irrelevant reviews (RandomReviewer, SecondHand_Info)

3. Performance Evaluation

The model achieves the following metrics on our test dataset:

Metric Validation Set Test Set
Weighted F1-Score 0.81 0.78
Weighted Precision 0.78 0.75
Weighted Recall 0.84 0.81

4. Large Dataset Testing

Generate a larger test dataset for performance evaluation:

import pandas as pd
import random

base_reviews = [
    'Great food and amazing service!',
    'Visit our website www.deals.com for offers!',
    'TERRIBLE PLACE!!!! WORST EVER!!!!',
    'Never been here but heard bad things.',
    'Nice atmosphere, good food.',
]

# Generate 1000 rows for testing
data = []
for i in range(1000):
    review = random.choice(base_reviews)
    data.append({
        'review_text': f'{review} #{i}',
        'rating': random.randint(1, 5),
        'user': f'User_{i}',
        'timestamp': f'2024-01-{(i%30)+1:02d}'
    })

df = pd.DataFrame(data)
df.to_csv('large_test.csv', index=False)

Implementation Details

1. Feature Engineering

From the raw dataset, additional features were engineered:

  1. Review Length: Word count analysis for quality assessment
  2. Relevance Scoring: Cross-encoder model comparing review text with business category keywords (0-1 similarity score)

2. Multi-Modal Model Architecture

  • Text Encoder: DistilRoBERTa for review text processing
  • Feature Integration: Combines text embeddings with numerical features
  • Classification Head: Multi-class output with confidence scores

3. Rule-Based Classification

Complementary rule-based system for high-precision detection:

  • Empty/missing text β†’ Low Quality
  • URL detection β†’ Advertisement check
  • Keyword matching β†’ Advertisement flagging
  • Pattern recognition β†’ Rant/spam detection

Troubleshooting

Common Issues

Command Not Found Error:

# Ensure virtual environment is activated
source venv/bin/activate
pip install -r requirements.txt
streamlit --version  # Verify installation

Port Already in Use:

streamlit run app.py --server.port 8502

Import Errors:

  • Ensure you're in the project root directory
  • Verify all dependencies are installed
  • Check Python path includes project directory

Deployment

This application can be deployed to:

  • Streamlit Cloud (recommended)
  • Hugging Face Spaces
  • Any platform supporting Streamlit apps

Reflections & Future Improvements

Current Limitations

  1. Dataset Imbalance: Majority of reviews categorized as High Quality, potentially causing model bias
  2. Limited User Context: Could benefit from reviewer history and profile metadata for stronger signals
  3. Language Support: Currently optimized for English reviews

Future Enhancements

  1. Active Learning: Incorporate user feedback to improve model accuracy
  2. Multilingual Support: Extend classification to multiple languages
  3. Real-time Processing: Stream processing for live review analysis
  4. Advanced Features: Sentiment analysis, topic modeling, and trend detection

Team Contributions

Alvin

  • Data Collection: Scraping data from Apify using Google Maps Reviews Scraper and Google Maps Scraper
  • Data Processing: Development of code for data preprocessing and feature engineering
  • Model Development: Development and training of the multi-modal machine learning model
  • Backend Logic: Implementation of classification algorithms and rule-based systems

Kerway

  • Model Development: Development and training of machine learning model architecture
  • Documentation: Devpost submission preparation and technical documentation
  • UI Development: User interface design and implementation for the Streamlit application
  • Project Management: Coordination of development workflows and deliverables

Yuechen

  • Web Application: Full-stack development of the Streamlit web application
  • Video Production: Creation of demonstration video showcasing the system capabilities
  • Team Coordination: Project management, timeline coordination, and team communication
  • Integration: System integration and deployment preparation

License

This project is open source and available under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors