Inspiration

Every person in this room has waited in an emergency room. You sat there for hours not knowing when you'd be seen. What you didn't know — is that the staff didn't know either.

Emergency departments across California have no early warning system. Staff only know they're overwhelmed after wait times have already spiked. By then patients are waiting, outcomes are worse, and the damage is done. The data to fix this has existed for 13 years. Nobody built the warning system. We did.

What It Does

Aegis is an ED overload early warning system that:

  • Predicts next-year overload risk per facility using a logistic regression model (AUC 0.976) trained on 13 years of California ED data
  • Explains every prediction through interpretable feature importances and SHAP-style driver bars
  • Simulates what-if scenarios in real time via an interactive ED visit volume slider
  • Provides context-aware recommendations grounded in each facility's actual characteristics — bed size, ownership type, urban/rural designation, and Health Professional Shortage Area status
  • Forecasts statewide burden through 2027 using ARIMA time series modeling
  • Visualizes findings through Omni-built governed analytics dashboards

How We Built It

Data Pipeline

We merged two official HCAI California government datasets:

  • CA ED Encounters by Facility (2012–2024, CSV)
  • ED Volume and Capacity Report (2021–2023, XLSX)

The merge gave us the true burden ratio:

$$\text{burden_score} = \frac{\text{ED visits}}{\text{treatment stations}}$$

Normalized to a 0–2 scale using facility-specific 90th percentile thresholds. Output: 12,311 records across 440 facilities, 14 engineered features.

Feature Engineering

For each facility per year we computed:

  • Lag features: burden_lag_1, burden_lag_2, burden_lag_3 (burden score from 1,2,3 years ago)
  • Rolling averages: 3-year and 7-year rolling mean
  • Volatility: 7-year rolling standard deviation
  • Momentum: recent percent change in burden
  • Capacity: visits per treatment station (lagged)
  • Time features: post-COVID flag, year index

Target Variable

$$\text{high_burden_next} = \begin{cases} 1 & \text{if next year burden} > P_{75}\text{(facility)} \ 0 & \text{otherwise} \end{cases}$$

Facility-specific 75th percentile threshold — each hospital is benchmarked against its own history, not a statewide average.

Models

Model Purpose Key Metric
Logistic Regression Interpretable classifier AUC 0.976
XGBoost High accuracy classifier AUC 0.975
Ridge Regression Continuous burden forecast CV R² 0.87
ARIMA(1,1,1) Time series forecast 2025–2027 AIC 3.00

Time-aware train/test split: trained on 2012–2021, tested on 2022–2024. No data leakage.

Statistical Inference

  • Paired T-Test: temporal burden pattern (p = 0.042, significant)
  • Pre vs Post-COVID T-Test (p = 0.107, not significant — burden impact was distributed unevenly across facilities)
  • Mann-Whitney U: COMPREHENSIVE vs BASIC facilities (p ≈ 0, highly significant)
  • 95% Confidence Intervals per facility using scipy
  • Rolling Z-Score anomaly detection (z > 2.0 threshold)
  • Cross-validated R² = 0.87 ± 0.007 on Ridge Regression

App

Built in Streamlit with Plotly charts. Three pages:

  1. Overview — burden trend, ARIMA forecast, facility comparison
  2. Risk Score — live model prediction, what-if slider, context-aware recommendations
  3. Insights — facility ranking with CI error bars, county burden chart, statewide risk scorecard

Omni Integration

https://ucidm.omniapp.co/dashboards/d1c4e08a

All four Omni visualizations were built from ca_ed_final.csv connected directly to Omni Analytics:

  • Top 20 facilities by average burden score
  • Statewide burden trend 2012–2024
  • Top 10 counties by burden
  • Burden by service level (BASIC vs COMPREHENSIVE vs STANDBY)

Key Findings

  1. Kaiser Foundation Hospital - Fontana is the most chronically burdened facility in California (avg 1.81)
  2. Kings County is the highest burden county (avg 1.02) — driven by rural primary care shortage in the San Joaquin Valley
  3. Treatment stations are the #1 predictor of overload at 54.4% XGBoost feature importance — capacity drives risk more than visit volume
  4. COMPREHENSIVE facilities carry 14% higher median burden than BASIC facilities (p ≈ 0) — the most capable hospitals absorb the most pressure
  5. Post-COVID burden surged 70% above pre-pandemic levels by 2022–2023 as deferred care flooded back — but the impact was distributed unevenly across facilities (p = 0.107)
  6. Statewide burden forecast to remain elevated at ~0.63 through 2027 per ARIMA model

Challenges

  • No treatment station data before 2021 — solved by normalizing to facility-specific 90th percentile as a proxy for years without station counts, with transparent disclosure
  • Class imbalance — only 3.6% of cases are high burden. Solved with class_weight='balanced' in logistic regression and scale_pos_weight=3 in XGBoost. Used AUC not accuracy as evaluation metric.
  • Annual data granularity — 13 data points per facility limits time series precision. ARIMA confidence intervals are wide and disclosed honestly. Future work: monthly data.
  • Ridge regression overfitting — plain linear regression produced R²=1.0 on 13 annual points. Switched to Ridge with alpha=1.0 and 5-fold cross-validation for honest evaluation (CV R²=0.87, test R²=0.41 on post-COVID years).
  • Post-COVID structural shift — the model trained on 2012–2021 had never seen the 2022–2023 surge. Test R² dropped to 0.41 on these years — disclosed transparently in the notebook.

What We Learned

  • Capacity constraints (treatment stations) matter more than volume for predicting ED overload
  • ED burden is a slow-burning structural problem, not an episodic crisis — zero anomalies at z > 2.0
  • The same high risk score means completely different interventions at different facilities — context-aware recommendations require facility-level data, not just predictions
  • Honest reporting of limitations (wide CIs, post-COVID R² drop, class imbalance) builds more credibility with judges than hiding them

What's Next

  • Monthly data integration for sub-annual predictions
  • Facility-specific ARIMA models replacing statewide aggregate
  • Real-time data pipeline from HCAI API
  • Expand to other states using CMS hospital datasets
  • Connect recommendations to actual FQHC locations and county health department contacts

Built With

Share this project:

Updates