An end-to-end machine learning project that explores historical U.S. baby-name trends and turns the strongest findings into an interactive prediction website.
- Public demo: babynamesprediction.streamlit.app
- Current data coverage:
1880-2024 - Current deployment target: Streamlit Community Cloud on
Python 3.10
This project asks a simple but engaging question:
Can historical naming patterns help predict whether a baby name is likely to become highly popular?
To answer that, I combined historical U.S. baby-name data with feature engineering, classification modeling, trend analysis, and a Streamlit app that lets users explore names and test prediction scenarios.
- Historical trend exploration using U.S. baby-name records
- Feature engineering based on name structure and popularity patterns
- Logistic regression modeling for Top 100 popularity prediction
- Additional clustering and time-series experimentation in notebooks
- A portfolio-ready Streamlit interface for trend exploration and prediction
This is more than a notebook-only capstone. It shows an end-to-end workflow:
- Frame a question that non-technical audiences can understand
- Clean and reshape a large historical dataset
- Engineer features for modeling
- Build a predictive workflow
- Turn the analysis into a user-facing app
That combination makes it a strong data + product storytelling project.
The Streamlit app is organized into four sections:
- Overview: project framing, dataset size, and recent top-name examples
- Trend Explorer: compare names across time using count and ratio-based metrics
- Prediction Studio: generate a Top 100 prediction using the trained model
- Project Insights: summarize why the project is interesting and how it can improve
The app lives here:
The main dataset used by the app includes:
NameYearGenderCountName_RatioGender_Name_Ratio
The current public app and packaged deployment dataset are updated through 2024.
The prediction workflow also uses engineered features derived in the modeling notebooks, including:
Is_FamousGender_BinaryRolling_Average_Gender_Ratio_5_YearsVowel_CountEnds_With_Specified_Letters
The current interactive prediction experience is centered on a logistic regression model trained to estimate whether a name is likely to land in the Top 100.
Current retrained model snapshot:
- Training window:
1880-2024 - Training rows:
145,000top-1000 yearly records - Test accuracy:
0.9785 - Test ROC AUC:
0.9953
Supporting experiments in the repository include:
- clustering analysis
- feature exploration
- time-series forecasting notebooks
For the portfolio version of the project, the app emphasizes the clearest value proposition:
trend exploration + popularity prediction
data_source/raw and supporting data filesmodels/trained model artifacts used by the appnotebooks/analysis, feature engineering, and modeling notebooksreferences/research and background readingreports_slides/course reports and presentationssrc/supporting project notesstreamlit/interactive app codeconda.ymlproject environment specification
- streamlit/app1.py
- reports_slides/An analysis of historical data and trends.pdf
- reports_slides/Machine Learning Models Analysis.pdf
- reports_slides/S3_Capstone_BabyName.pdf
This project includes both a Conda environment file and a pip-friendly requirements file:
Typical local workflow:
conda env create -f conda.yml
conda activate capstone
pip install -r requirements.txt
streamlit run streamlit/app1.pyThe easiest way to share this project publicly is through Streamlit Community Cloud.
Current production URL:
Deployment settings:
- Repository root contains
requirements.txt - Entrypoint file:
streamlit/app1.py - Recommended Python version on Streamlit Cloud:
3.10
To make deployment practical on GitHub, the app reads a lightweight packaged dataset from:
data_source/app_data.pkl.gz
If that file is not present, the app falls back to the larger local development dataset:
notebooks/data.csv
If you want to refresh the app with the newest official SSA release:
- Download the official SSA national data file (
names.zip) or the yearly file (for exampleyob2024.txt). - Save the SSA birth totals page (
numberUSbirths.html) locally from:https://www.ssa.gov/OACT/babynames/numberUSbirths.html
- Run:
python3 scripts/update_babyname_data.py \
--source /path/to/names.zip \
--birth-totals-html /path/to/numberUSbirths.html \
--year 2024This updates both:
notebooks/data.csvdata_source/app_data.pkl.gz
If you also want the prediction model to reflect the new year, retrain it after the data refresh:
python3 scripts/retrain_logistic_model.pyThat rebuilds:
notebooks/Featured_Data.csvmodels/preprocessor.pklmodels/logistic_model.pkl
- Add screenshots or GIFs of the app for portfolio readers
- Make the famous-name feature source more transparent
- Link the app from a future portfolio website
Some larger datasets and model-related resources are also referenced through this Google Drive folder:
Ying Zhou