A personal health analytics project to analyze continuous glucose monitor (CGM) data using Snowflake, Snowflake Cortex AI, and Snowpark for Python.
As a person managing Type 2 Diabetes, I built this project to move beyond simple averages like HbA1c and analyze the real-world impact of glycemic variability.
The core hypothesis is that large, rapid glucose excursions are disproportionately damaging, and that a 3-month average (HbA1c) can be dangerously misleading. This project uses data analysis and machine learning to quantify, classify, and detect these high-variability events from over 96,000 personal CGM readings.
- Data Warehouse: Snowflake
- Data Pipeline: Snowflake SQL (ELT Pattern)
- Core Table:
RAW_READINGS(contains all Type 0, 1, and 6 records) - NLP/AI: Snowflake Cortex AI (
COMPLETEfunction) - ML Modeling: Snowpark for Python (with
scikit-learn) - Source Control: Git / GitHub
The source data from the LibreView export is not in a clean, queryable format. A multi-step ELT (Extract-Load-Transform) pipeline was built to clean and model the data for analysis.
The raw .csv export contains several data quality issues:
- Junk Header: A non-data title row is present on row 1.
- Misaligned Headers: The true column headers are on row 2.
- Conditional Data: Glucose readings are stored in two different columns (
Historic Glucose mg/dLvs.Scan Glucose mg/dL) based on aRecord Type(0 or 1). - Sparse Data: Non-reading events (
Record Type6), which contain all lifestyle notes, are mixed with valid glucose readings. - Fragmented Data: Lifestyle notes are spread across 10+ separate columns.
A robust ELT pattern was used to ensure data integrity and auditability.
- Extract/Load: The raw CSV is first uploaded to a Snowflake Stage (
@RAW_FILE_STAGE). It is then loaded as-is into a staging table (LANDING_LIBRE_RAW) where all 19 columns are defined asVARCHAR. This prevents type-mismatch errors and preserves the original source data for lineage. - Transform: A
CREATE TABLE AS SELECT ...(CTAS) statement transforms the raw data into a clean, analytics-ready table (RAW_READINGS).
The transformation query (03_ELT.sql) performs the following business logic:
- Filtering: Removes the junk header row (
WHERE "Record Type" != 'Record Type'). - Preservation: Keeps all critical record types (
Record TypeIN ('0', '1', '6')). - Pivoting (CASE): A
CASEstatement unifies the two conditional glucose columns into a singleGLUCOSE_VALUEcolumn. - Merging (COALESCE): A
COALESCEfunction merges the 10+ disparate note columns into a single, cleanNOTESfield. - Type Casting: All
VARCHARfields are safely cast to their proper data types (TIMESTAMP_NTZ,NUMBER).
With the clean RAW_READINGS table, the analysis pipeline can be executed.
Time-series analysis is performed using SQL window functions (LAG, AVG() OVER...) to find spikes, drops, and rolling averages. A custom-engineered feature, GLYCATION_INDEX, is created using POW(GLUCOSE_VALUE - 140, 3) to non-linearly penalize high-excursion events, moving beyond simple averages to quantify glycemic variability.
This step is a key example of solving a "human-in-the-loop" data problem.
- The Problem: The source application provides rigid, structured fields for logging (e.g.,
Carbohydrates (grams)). These fields are cumbersome, forcing the user to become a data-entry clerk for their own life (e.g., "ate salad"). - The Human Workaround: Like any user, I bypassed the rigid fields and logged rich, unstructured notes in the single, free-text
"Notes"field. - The
COALESCESolution: The ELT pipeline first merges all 10+ note-like columns into a singleNOTESfield. - The AI Solution: A
SNOWFLAKE.CORTEX.COMPLETEfunction is used to perform NLP classification on this singleNOTESfield. It transforms the unstructured "human" text ("ate salad") into the structured categorical data (EATING) that the original application failed to capture. This turns a messy workaround into a powerful, clean feature.
A Snowpark Python Stored Procedure will be used to train a scikit-learn Isolation Forest model. This model will use features like GLUCOSE_VALUE, rate-of-change, and hour-of-day to find "unknown unknowns"—statistically anomalous glucose events that are not simple spikes and require further investigation.
The raw CGM data (.csv) for this project is personal, private, and sensitive health information. It is not, and will not be, included in this repository.
The .gitignore file is configured to explicitly block this file from ever being committed. This repository contains only the SQL, Python, and configuration code used for the analysis.
This project is licensed under the PolyForm Noncommercial 1.0.0. See the LICENSE file for details.