Inspiration
California faces one of the most severe homelessness crises in the U.S., impacting health systems, social support structures, and community well-being. With multiple fragmented data sources available, we wanted to bring them together to identify patterns, uncover service gaps, and build predictive models that could inform future policy decisions. Our goal was to make data-driven insights more accessible to public stakeholders, researchers, and policymakers alike.
What it does
Data Doomsday_Random is a data analytics pipeline that integrates homelessness, healthcare, and system performance data from across California to:
Understand demographic trends in homelessness by county, age, gender, and race
Analyze healthcare service usage by homeless individuals
Visualize performance of housing systems over time
Engineer features to reflect access burden and capacity
Cluster counties based on shared characteristics
Build an interpretable model to explain variation in homelessness
Forecast 2024 trends and recommend counties for targeted intervention
How we built it
We used Python and Google Colab for all analysis and modeling. Key libraries and techniques include:
Pandas for data cleaning and transformation
Matplotlib & Seaborn for visualizations
Scikit-learn for clustering (KMeans) and linear regression modeling
Feature engineering techniques such as shelter capacity ratios and year-over-year change
Trend analysis from 2020 to 2023, with extrapolated forecasts for 2024
All datasets were publicly available through California's Open Data portals and cleaned for consistency across time and geography.
Challenges we ran into
Merging multiple datasets with inconsistent geographic labels and missing years
Handling missing values while preserving meaningful trends
Balancing interpretability and accuracy in our model
Limited documentation and metadata for some datasets slowed initial exploration
Accomplishments that we're proud of
Successfully created a unified, analysis-ready dataset from three different sources
Visualized key service gaps and demographic trends clearly
Built a baseline regression model and identified high-residual counties that might need special attention
Implemented clustering to find groups of counties with similar service challenges
What we learned
How to work with real-world public data that is messy, inconsistent, and incomplete
The importance of feature engineering in making sense of complex social systems
How to use clustering and residual analysis to guide deeper investigation
That even simple models can offer actionable insights when paired with thoughtful data preparation
What's next for Data Doomsday_Random
Add geospatial visualizations to better map service gaps across regions
Integrate weather, eviction, and unemployment data for deeper predictive power
Use more advanced models like XGBoost or Time Series Forecasting (ARIMA/LSTM) for policy simulation
Build an interactive dashboard using Streamlit or Tableau to make insights accessible to non-technical audiences
Built With
- colab
- python
Log in or sign up for Devpost to join the conversation.