About Me

I'm a bachelor’s student at Michigan State University, majoring in Data Science.
I have experience with Python, R, SQL, and various data analysis and visualization tools. I'm currently seeking internships and entry-level positions in data science to further develop my skills and contribute to impactful projects.

Projects

Thumbnail for Amazon Review Analyzer

Amazon Review Analyzer

Machine-learning project that classifies Amazon product reviews as AI-generated (fake) or human-written using text preprocessing, engineered features, and multiple model families (TF–IDF + logistic regression, XGBoost, and a BERT LoRA adapter). Includes a Streamlit webapp for live inference and model comparison.

Python pandas scikit-learn xgboost PyTorch transformers peft streamlit joblib
More details

Problem

Fake or AI-generated product reviews harm customers and distort product ratings. This project detects likely AI-generated or fake Amazon reviews from review text (and optional rating context) to improve trust and support moderation workflows.

Approach

Ingest raw CSV review data, clean and normalize text, and engineer features used by an XGBoost model while also training a TF–IDF baseline and a BERT model using LoRA adapters for efficient fine-tuning. Model artifacts are saved to `model/` and an interactive Streamlit app (`webapp/streamlit_app.py`) allows live classification and feature inspection.

Results & Impact

The repository includes trained artifacts (joblib and PEFT/LoRA adapter files) and utilities for preprocessing, training, and evaluation. The Streamlit app lets users quickly compare model predictions and confidence scores to support moderation or further analysis.

Thumbnail for NYC 311 Service Request Analysis

NYC 311 Service Request Analysis

Analyzed NYC 311 complaint data with Athena SQL and a regression modeling workflow to explore service patterns and estimate how long requests take to close. The project combines data preparation, exploratory analysis, and a baseline AWS SageMaker comparison.

SQL AWS Athena Python pandas scikit-learn Amazon SageMaker Jupyter Notebooks
More details

Problem

City service teams need to understand which complaint types and agencies create the most workload and how long requests typically stay open. The goal was to analyze 311 service requests and predict days to close so stakeholders can prioritize responses more effectively.

Approach

Built Athena queries to clean and model a 200k-row sample of NYC 311 complaints, joined complaints to the agencies lookup, and created a modeling dataset with agency, borough, complaint type, ZIP code, time of day, and same-day volume features. Trained a baseline linear regression model and compared it with an AWS SageMaker Linear Learner run using RMSE, MAE, and R^2.

Results & Impact

The baseline model reached RMSE 4.092, MAE 1.874, and R^2 0.364 on the modeling plan. The SageMaker Linear Learner performed similarly with RMSE 4.05, MAE 1.89, and R^2 0.3829, suggesting the simpler sklearn approach is adequate for this dataset. The analysis also surfaces complaint patterns by borough and agency for stakeholder review.

Thumbnail for CMSE202 Final Project Honors: Disease Agent-Based Model

CMSE202 Final Project Honors: Disease Agent-Based Model

Built an agent-based model to explore how a COVID-19 outbreak could progress in a Michigan community on May 12, 2020. The project uses object-oriented programming, CDC-based assumptions, and an animation to show how the disease spreads through a simulated population.

Python NumPy Matplotlib Object-Oriented Programming Jupyter Notebooks ipython.display random
More details

Problem

The project asked what an agent-based model of COVID-19 progression would look like for Michigan on May 12, 2020. The goal was to capture how infection and recovery dynamics might evolve under realistic but simplified assumptions about transmission and mortality.

Approach

Researched infection and death rates from CDC sources, then translated those rates into a simplified agent-based model built with object-oriented Python classes. The model used SIR-style health states, scaled down the population for visualization, and produced an animated simulation in Jupyter Notebook.

Results & Impact

The final notebook demonstrated a working animated disease spread simulation and highlighted the effect of assumptions on model output. A key lesson was that naive rate choices produced unrealistically low spread, so the model was revised using weekly positive-test percentages and a smaller display scale to make the dynamics easier to interpret.

Contact