TechJam2025 Review Quality Project

Overview

This project builds a full pipeline for assessing the quality and genuineness of Google location reviews.
The system detects spam, ads, irrelevant reviews, and rants, while also scoring relevancy and visit likelihood.
We leverage rule-based heuristics, silver labeling with LLMs, and traditional ML training to produce reliable classifiers for policy enforcement.

Prerequisites

Setup Environment

conda create --name ratu python=3.12
conda activate ratu
pip install -r requirements.txt

Input Data

Due to time constraints, we will be using a smaller dataset limited to the Alaska region, available at https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal. At the root of the project directory, create the folder:

datasets/<thedataset> # json.gz
datasets/<meta> #json.gz

We will use both the metadata for Alaska and the Alaska review subset.

Environment variables

Create a .env file at the repository root with your credentials/tokens:

SUPABASE_URL=your_supabase_url
SUPABASE_SERVICE_ROLE_KEY=your_service_role_key
SUPABASE_ANON_KEY=your_anon_key
HF_TOKEN=your_huggingface_token

These are loaded by the scripts at runtime. Ensure the .env file is not committed to version control.

Hugging Face setup (token, login, Gemma access)

Create an account and token

Sign in at Hugging Face
Create a Personal Access Token with at least "read" scope at Settings → Access Tokens

Log in via CLI (recommended)

conda activate ratu
pip install -U "huggingface_hub[cli]"
huggingface-cli login

Windows PowerShell non-interactive option:

$env:HF_TOKEN="paste-your-token-here"
huggingface-cli login --token $env:HF_TOKEN

Add token to .env (alternative)

You can also set HF_TOKEN in your .env (see section above). On Windows PowerShell for the current session:

$env:HF_TOKEN="paste-your-token-here"

Request access to Gemma

Visit the model page and click "Agree and access" / "Request access":
- Gemma 3 1B Instruct
Access must be granted for your account before running the silver labeling pipeline.

Verify setup

huggingface-cli whoami
python -c "from huggingface_hub import HfApi; print(HfApi().model_info('google/gemma-3-1b-it').id)"

Pipelines

1. DataCollectionPipeline

Collects and preprocesses raw review data.
Adds metadata features (length, emojis, rating deviation, etc.).
Ensures review_id is generated consistently.

2. FeatureEngineeringPipeline

Applies handcrafted rules (regex, heuristics) to flag spam, ads, irrelevant, and rant reviews.

3. RuleBasedPipeline

Produces rule scores and binary strong indicators.

4. SilverLabelingPipeline

Uses an LLM (Gemma-3-12B-it or similar) to produce "silver" probabilistic labels for each review.
Extracts structured JSON scores for downstream training.

5. GoldLabelingPipeline

(Optional, when human annotations are available)
Stratified sampling of reviews for annotation.
Merges human-provided gold labels back into dataset.

6. DatasetSplitPipeline

Cleans dataset, applies hygiene filters.
Stratified split into train / validation / test with class balance guarantees.
Outputs per-split JSONL and fold CSV for CV.

7. ModelTrainingPipeline

Trains LightGBM models for classification (ads_promo, spam_low_quality, irrelevant, rant_no_visit)
and regression (relevancy_score, visit_likelihood).
Saves models, tuned thresholds, and validation/test predictions.

8. EvaluationPipeline

Evaluates predictions on val/test splits.
Computes metrics (precision, recall, F1, ROC-AUC, PR-AUC, calibration, etc.).
Generates plots and Markdown summary reports.

9. PolicyEnforcementPipeline

Combines model outputs into final policy flags.
Flags reviews as ads, spam, irrelevant, rant, or genuine.
Produces final dataset for decision-making.

How to Run

Pipeline	Script Command Example
Data Collection	`python -m scripts.run_data_collection --config configs/data_collection.yaml`
Feature Engineering	`python -m scripts.run_feature_engineering --config configs/feature_engineering.yaml`
Rule-based Scoring	`python -m scripts.run_rules_baseline --config configs/rules_baseline.yaml`
Silver Labeling	`python -m scripts.run_silver_labeling --config configs/silver_labeling.yaml`
Gold Labeling (optional)	`python -m scripts.run_gold_labeling --config configs/gold_labeling.yaml`
Dataset Split	`python -m scripts.run_dataset_split --config configs/dataset_split.yaml`
Model Training	`python -m scripts.run_model_training --config configs/model_training.yaml`
Evaluation	`python -m scripts.run_evaluation --config configs/evaluation.yaml`
Policy Enforcement	`python -m scripts.run_policy_enforcement --config configs/policy_enforcement.yaml`

Demo Script

run:

python -m scripts.demo --config configs/demo.yaml

The demo will:

Load a small sample of reviews.
Run rule-based checks + silver labeling.
Train lightweight models quickly.
Show evaluation summary and final policy flags.

This demonstrates the end-to-end flow without requiring the full dataset.

For simple UI demonstration of our plicy enforcement, you can run:

streamlit run streamlit_demo.py

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
configs		configs
prompts		prompts
scripts		scripts
src/techjam2025		src/techjam2025
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
streamlit_demo.py		streamlit_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TechJam2025 Review Quality Project

Overview

Prerequisites

Setup Environment

Input Data

Environment variables

Hugging Face setup (token, login, Gemma access)

Pipelines

1. DataCollectionPipeline

2. FeatureEngineeringPipeline

3. RuleBasedPipeline

4. SilverLabelingPipeline

5. GoldLabelingPipeline

6. DatasetSplitPipeline

7. ModelTrainingPipeline

8. EvaluationPipeline

9. PolicyEnforcementPipeline

How to Run

Demo Script

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TechJam2025 Review Quality Project

Overview

Prerequisites

Setup Environment

Input Data

Environment variables

Hugging Face setup (token, login, Gemma access)

Pipelines

1. DataCollectionPipeline

2. FeatureEngineeringPipeline

3. RuleBasedPipeline

4. SilverLabelingPipeline

5. GoldLabelingPipeline

6. DatasetSplitPipeline

7. ModelTrainingPipeline

8. EvaluationPipeline

9. PolicyEnforcementPipeline

How to Run

Demo Script

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages