Aegis

Inspiration

Every person in this room has waited in an emergency room. You sat there for hours not knowing when you'd be seen. What you didn't know — is that the staff didn't know either.

Emergency departments across California have no early warning system. Staff only know they're overwhelmed after wait times have already spiked. By then patients are waiting, outcomes are worse, and the damage is done. The data to fix this has existed for 13 years. Nobody built the warning system. We did.

What It Does

Aegis is an ED overload early warning system that:

Predicts next-year overload risk per facility using a logistic regression model (AUC 0.976) trained on 13 years of California ED data
Explains every prediction through interpretable feature importances and SHAP-style driver bars
Simulates what-if scenarios in real time via an interactive ED visit volume slider
Provides context-aware recommendations grounded in each facility's actual characteristics — bed size, ownership type, urban/rural designation, and Health Professional Shortage Area status
Forecasts statewide burden through 2027 using ARIMA time series modeling
Visualizes findings through Omni-built governed analytics dashboards

How We Built It

Data Pipeline

We merged two official HCAI California government datasets:

CA ED Encounters by Facility (2012–2024, CSV)
ED Volume and Capacity Report (2021–2023, XLSX)

The merge gave us the true burden ratio:

$$\text{burden_score} = \frac{\text{ED visits}}{\text{treatment stations}}$$

Normalized to a 0–2 scale using facility-specific 90th percentile thresholds. Output: 12,311 records across 440 facilities, 14 engineered features.

Feature Engineering

For each facility per year we computed:

Lag features: burden_lag_1, burden_lag_2, burden_lag_3 (burden score from 1,2,3 years ago)
Rolling averages: 3-year and 7-year rolling mean
Volatility: 7-year rolling standard deviation
Momentum: recent percent change in burden
Capacity: visits per treatment station (lagged)
Time features: post-COVID flag, year index

Target Variable

$$\text{high_burden_next} = \begin{cases} 1 & \text{if next year burden} > P_{75}\text{(facility)} \ 0 & \text{otherwise} \end{cases}$$

Facility-specific 75th percentile threshold — each hospital is benchmarked against its own history, not a statewide average.

Models

Model	Purpose	Key Metric
Logistic Regression	Interpretable classifier	AUC 0.976
XGBoost	High accuracy classifier	AUC 0.975
Ridge Regression	Continuous burden forecast	CV R² 0.87
ARIMA(1,1,1)	Time series forecast 2025–2027	AIC 3.00

Time-aware train/test split: trained on 2012–2021, tested on 2022–2024. No data leakage.

Statistical Inference

Paired T-Test: temporal burden pattern (p = 0.042, significant)
Pre vs Post-COVID T-Test (p = 0.107, not significant — burden impact was distributed unevenly across facilities)
Mann-Whitney U: COMPREHENSIVE vs BASIC facilities (p ≈ 0, highly significant)
95% Confidence Intervals per facility using scipy
Rolling Z-Score anomaly detection (z > 2.0 threshold)
Cross-validated R² = 0.87 ± 0.007 on Ridge Regression

App

Built in Streamlit with Plotly charts. Three pages:

Overview — burden trend, ARIMA forecast, facility comparison
Risk Score — live model prediction, what-if slider, context-aware recommendations
Insights — facility ranking with CI error bars, county burden chart, statewide risk scorecard

Omni Integration

https://ucidm.omniapp.co/dashboards/d1c4e08a

All four Omni visualizations were built from ca_ed_final.csv connected directly to Omni Analytics:

Top 20 facilities by average burden score
Statewide burden trend 2012–2024
Top 10 counties by burden
Burden by service level (BASIC vs COMPREHENSIVE vs STANDBY)

Key Findings

Kaiser Foundation Hospital - Fontana is the most chronically burdened facility in California (avg 1.81)
Kings County is the highest burden county (avg 1.02) — driven by rural primary care shortage in the San Joaquin Valley
Treatment stations are the #1 predictor of overload at 54.4% XGBoost feature importance — capacity drives risk more than visit volume
COMPREHENSIVE facilities carry 14% higher median burden than BASIC facilities (p ≈ 0) — the most capable hospitals absorb the most pressure
Post-COVID burden surged 70% above pre-pandemic levels by 2022–2023 as deferred care flooded back — but the impact was distributed unevenly across facilities (p = 0.107)
Statewide burden forecast to remain elevated at ~0.63 through 2027 per ARIMA model

Challenges

No treatment station data before 2021 — solved by normalizing to facility-specific 90th percentile as a proxy for years without station counts, with transparent disclosure
Class imbalance — only 3.6% of cases are high burden. Solved with class_weight='balanced' in logistic regression and scale_pos_weight=3 in XGBoost. Used AUC not accuracy as evaluation metric.
Annual data granularity — 13 data points per facility limits time series precision. ARIMA confidence intervals are wide and disclosed honestly. Future work: monthly data.
Ridge regression overfitting — plain linear regression produced R²=1.0 on 13 annual points. Switched to Ridge with alpha=1.0 and 5-fold cross-validation for honest evaluation (CV R²=0.87, test R²=0.41 on post-COVID years).
Post-COVID structural shift — the model trained on 2012–2021 had never seen the 2022–2023 surge. Test R² dropped to 0.41 on these years — disclosed transparently in the notebook.

What We Learned

Capacity constraints (treatment stations) matter more than volume for predicting ED overload
ED burden is a slow-burning structural problem, not an episodic crisis — zero anomalies at z > 2.0
The same high risk score means completely different interventions at different facilities — context-aware recommendations require facility-level data, not just predictions
Honest reporting of limitations (wide CIs, post-COVID R² drop, class imbalance) builds more credibility with judges than hiding them

What's Next

Monthly data integration for sub-annual predictions
Facility-specific ARIMA models replacing statewide aggregate
Real-time data pipeline from HCAI API
Expand to other states using CMS hospital datasets
Connect recommendations to actual FQHC locations and county health department contacts

Built With

canva
github
joblib
numpy
omni
openpyxl
pandas
plotly
python
scikit-learn
scipy
statsmodels
streamlit
xgboost

Submitted to

Data Heist 2026

Created by

I worked on the cleaning data and EDA process. I also did the interpretation of the model

Huy Quoc Tran
I led the full development of Aegis from concept to deployment. I built and iterated the entire Streamlit application, designed the UI from scratch, ran the data pipeline across both HCAI datasets, and connected the real models to the frontend. I also built all four Omni visualizations, structured the project architecture, and coordinated the team's technical and presentation deliverables throughout the hackathon.

My Truong
Mary
Sarah Yuan