Inspiration

California faces one of the most severe homelessness crises in the U.S., impacting health systems, social support structures, and community well-being. With multiple fragmented data sources available, we wanted to bring them together to identify patterns, uncover service gaps, and build predictive models that could inform future policy decisions. Our goal was to make data-driven insights more accessible to public stakeholders, researchers, and policymakers alike.

What it does

Data Doomsday_Random is a data analytics pipeline that integrates homelessness, healthcare, and system performance data from across California to:

Understand demographic trends in homelessness by county, age, gender, and race

Analyze healthcare service usage by homeless individuals

Visualize performance of housing systems over time

Engineer features to reflect access burden and capacity

Cluster counties based on shared characteristics

Build an interpretable model to explain variation in homelessness

Forecast 2024 trends and recommend counties for targeted intervention

How we built it

We used Python and Google Colab for all analysis and modeling. Key libraries and techniques include:

Pandas for data cleaning and transformation

Matplotlib & Seaborn for visualizations

Scikit-learn for clustering (KMeans) and linear regression modeling

Feature engineering techniques such as shelter capacity ratios and year-over-year change

Trend analysis from 2020 to 2023, with extrapolated forecasts for 2024

All datasets were publicly available through California's Open Data portals and cleaned for consistency across time and geography.

Challenges we ran into

Merging multiple datasets with inconsistent geographic labels and missing years

Handling missing values while preserving meaningful trends

Balancing interpretability and accuracy in our model

Limited documentation and metadata for some datasets slowed initial exploration

Accomplishments that we're proud of

Successfully created a unified, analysis-ready dataset from three different sources

Visualized key service gaps and demographic trends clearly

Built a baseline regression model and identified high-residual counties that might need special attention

Implemented clustering to find groups of counties with similar service challenges

What we learned

How to work with real-world public data that is messy, inconsistent, and incomplete

The importance of feature engineering in making sense of complex social systems

How to use clustering and residual analysis to guide deeper investigation

That even simple models can offer actionable insights when paired with thoughtful data preparation

What's next for Data Doomsday_Random

Add geospatial visualizations to better map service gaps across regions

Integrate weather, eviction, and unemployment data for deeper predictive power

Use more advanced models like XGBoost or Time Series Forecasting (ARIMA/LSTM) for policy simulation

Build an interactive dashboard using Streamlit or Tableau to make insights accessible to non-technical audiences

Built With

Share this project:

Updates