A comprehensive web application for analyzing infectious disease trends, generating forecasts, and visualizing epidemic patterns across countries and regions. Features advanced healthcare access clustering to understand global health infrastructure patterns.
This dashboard helps users understand and predict epidemic patterns by:
• Analyzing real outbreak data from COVID-19, SARS, and Monkeypox epidemics
• Generating 6-month forecasts of disease progression with confidence intervals
• Creating "what-if" scenarios showing how past outbreaks would look if they happened in 2025
• Mapping disease distribution worldwide with interactive visualizations
• Clustering countries by healthcare access using machine learning to identify access patterns
• Providing AI-generated insights in plain English about epidemic trends and healthcare capacity
We use authentic epidemic and health expenditure datasets, not simulated data:
• COVID-19 Data: 44,785 records from 2020-2022 covering 212 countries
- Source: COVID-19 Dataset on Kaggle
• SARS Data: 2,538 records from March-July 2003 covering 37 countries
• Monkeypox Data: 15,792 records from 2022 covering 109 countries
- Source: Monkeypox Dataset Daily Updated
• Total Epidemic Dataset: 63,115 records across 222 unique countries
• Health Expenditure Data: World Bank health spending indicators (2015-2022) covering 175 countries
- Health expenditure per capita (USD)
- Health expenditure as percentage of GDP
- GDP data for economic context
- Source: World Bank Open Data Platform
All data has been cleaned, standardized, and unified into a consistent format for analysis.
All datasets are publicly available and used under their respective licenses:
• COVID-19 data compiled by Bolkonsky
• SARS data maintained by imdevskp
• Monkeypox data curated by deepcontractor
• Health expenditure data from World Bank Open Data
We acknowledge the original data collectors and maintainers who made this analysis possible.
• Disease Selection: Choose between COVID-19, SARS, or Monkeypox
• Country Filtering: Select specific countries or view top affected regions
• Metric Options: Track total cases, daily new cases, deaths, or death rates
• Date Range Filtering: Focus on specific time periods
• Real-time Charts: Interactive time series and comparison visualizations
• Scientific Method: Uses exponential smoothing with epidemic curve modeling
• Confidence Intervals: Shows uncertainty ranges (95% confidence bands)
• Smart Dampening: Prevents unrealistic exponential growth projections
• Multiple Countries: Compare forecasts across different regions
• Performance Optimized: Handles up to 5 countries simultaneously for responsive analysis
• Reliability: Best for trend analysis and resource planning (not precise predictions)
• Machine Learning Clustering: K-means algorithm identifies 4 distinct healthcare access patterns
• Multi-Dimensional Analysis: Uses health spending per capita, GDP percentage, and economic capacity
• Real-World Categories:
- High Access - Advanced Economy (countries like USA, Germany)
- Medium-High Access - Developing (countries like Brazil, Thailand)
- High Priority - Limited Resources (countries with high health priority but resource constraints)
- Low Access - Resource Constrained (countries needing healthcare investment)
• Interactive Visualizations: Enhanced scatter plots with adjustable point sizes and jitter options
• Economic Pattern Analysis: GDP vs health spending relationships with trend lines
• Statistical Insights: Correlation analysis and efficiency metrics for each cluster
• Health Spending Tab: Bar charts comparing per capita spending and GDP percentage by cluster
• Global Distribution Tab:
- Interactive scatter plots with point sizes based on GDP
- Adjustable display options (point size, jitter, log scale)
- Cluster center visualization
- Comprehensive hover tooltips
• Economic Patterns Tab:
- Enhanced scatter plot showing GDP vs health spending relationships
- Optional trend lines for each cluster
- Correlation analysis and efficiency ratios
- Top performers identification
• Smart Color Coding: Consistent colors across all visualizations for easy pattern recognition
• Plain English Summaries: "Brazil: 15% increase trend observed over recent weeks (medium confidence)"
• Trend Analysis: Identifies increasing, decreasing, or stable patterns
• Confidence Scoring: High/Medium/Low reliability ratings based on data quality
• Visual Indicators: Color-coded confidence levels with trend arrows (📈📉➡️)
• Healthcare Access Insights: Automatic cluster characterization and country comparisons
• Educational Focus: Designed for learning and analysis, not clinical decisions
• What-If Analysis: Shows how historical outbreaks would unfold if they started in 2025
• Timeline Shifting: Maintains original outbreak patterns but with current dates
• Emergency Planning: Useful for preparedness exercises and resource planning
• Clear Warnings: Multiple reliability indicators (5-6/10) and appropriate use guidelines
• World Visualization: Choropleth maps showing case distribution by country
• Bubble Maps: Alternative view with geographic coordinates
• Regional Statistics: Breakdown of cases by continent
• Top Affected Lists: Sortable tables of most impacted countries
• Seamless Switching: Move between epidemic trends, disease map, and healthcare access clustering
• Consistent Design: Uniform interface across all components
• Easy Access: One-click navigation between different analysis modes
Users can adjust clustering parameters, forecasting options, and visualization settings to explore the data from different angles.
The dashboard now offers two complementary forecasting methods:
• Exponential Smoothing: Traditional statistical approach with high transparency
• PyTorch Neural Network: Machine learning approach for complex pattern recognition
• Educational Comparison: Users can compare both methods to understand their strengths and limitations
The dashboard incorporates research from Marcos et al. on epidemic forecasting and healthcare utilization:
• Pattern Recognition: Advanced algorithms for identifying epidemic curve patterns
• Healthcare Utilization Analysis: Methods for understanding healthcare system responses
• Visualization Techniques: Research-backed approaches to data visualization
• Educational Framework: Structured learning about epidemic forecasting principles
Time series forecasting is like predicting the weather - we look at patterns from the past to estimate what will happen in the future. In our case, instead of temperature and rainfall, we're looking at disease case numbers over time.
Simple Example:
• If COVID-19 cases were 100 today, 120 tomorrow, and 150 the next day
• We can see an upward trend and predict it might reach 180 cases the following day
• Our system does this mathematically across months of data
Think of exponential smoothing like a "weighted memory" system:
How It Works:
• Recent data matters more - Yesterday's case count is more important than last month's
• Older data still helps - But historical patterns provide context
• Smooths out noise - Ignores random daily fluctuations to see the real trend
Real-World Analogy:
Imagine you're trying to predict how busy a coffee shop will be tomorrow:
• Today's customer count (most important)
• This week's pattern (very important)
• Last month's data (somewhat important)
• Last year's data (background context)
This is exactly how exponential smoothing weighs epidemic data!
Epidemics follow predictable patterns called "epidemic curves":
Typical Epidemic Phases:
- Slow Start - Few cases, gradual increase
- Exponential Growth - Cases double rapidly (most dangerous phase)
- Peak - Maximum daily cases reached
- Decline - Cases start falling as population builds immunity/intervention works
- Tail - Low-level cases continue for extended period
Our Smart Adjustments:
• Epidemic Dampening - Prevents unrealistic endless growth predictions
• Trend Detection - Automatically identifies which phase an outbreak is in
• Context Awareness - Understands that epidemics eventually decline
Confidence intervals are like "error bars" that show uncertainty:
95% Confidence Interval Means:
• We're 95% confident the real number will fall within this range
• Upper bound - Worst-case scenario (higher than predicted)
• Lower bound - Best-case scenario (lower than predicted)
• Center line - Most likely outcome
Visual Example:
Predicted cases: 1,000
Confidence interval: 800 - 1,200
Meaning: We're 95% sure actual cases will be between 800-1,200
Step-by-Step Process:
-
Data Preparation
• Clean historical case numbers
• Identify trend patterns
• Remove data outliers
-
Pattern Recognition
• Calculate recent trend (growing/declining/stable)
• Measure volatility (how much numbers jump around)
• Assess data quality (more data = more confidence)
-
Future Projection
• Apply exponential smoothing to recent trends
• Add epidemic-specific dampening to prevent unrealistic growth
• Generate 180 daily predictions (6 months)
-
Uncertainty Calculation
• Measure how much past predictions varied from reality
• Apply this uncertainty to future predictions
• Create upper/lower confidence bounds
Why 6 Months?
• Short enough - Patterns don't change dramatically over this timeframe
• Long enough - Useful for planning and resource allocation
• Evidence-based - Research shows forecasting accuracy diminishes significantly beyond 6 months
• Practical balance - Provides actionable insights without overconfident long-term predictions
Higher Reliability (7-8/10):
• Trend Direction - Is the epidemic growing, declining, or staying stable?
• Pattern Recognition - How similar is this outbreak to historical ones?
• Relative Comparison - Which countries/regions are most affected?
• Resource Planning - Approximate healthcare capacity needs
Medium Reliability (5-6/10):
• Approximate Numbers - General magnitude of case counts (within broad ranges)
• Timeline Estimates - Rough timing of peaks or declines
• 2025 Projections - "What-if" scenario planning only
Lower Reliability (3-4/10):
• Exact Case Counts - Precise daily numbers
• External Factor Prediction - Policy changes, new variants, behavioral shifts
• Long-term Predictions - Anything beyond 6 months
• Individual Decision Making - Personal health or medical decisions
We Could Use Machine Learning, But:
• Data Requirements - Need massive datasets with consistent quality
• Overfitting Risk - Complex models can memorize noise instead of real patterns
• Interpretability - Harder to explain and understand how predictions are made
• Epidemic Context - General AI models don't understand epidemic-specific patterns
• Educational Value - Simpler models are better for learning and transparency
Exponential Smoothing Advantages:
• Proven Track Record - Used successfully in epidemiology for decades
• Transparent - Easy to understand how predictions are made
• Robust - Works well with limited or imperfect data
• Fast - Generates predictions quickly for multiple countries
• Educational - Great for understanding forecasting principles
Healthcare access clustering groups countries based on similar patterns of healthcare spending and economic capacity. Think of it like organizing countries into "healthcare neighborhoods" - countries with similar healthcare resources and priorities end up in the same group.
Instead of making subjective judgments about which countries have "good" or "poor" healthcare access, we let the data speak for itself. The clustering algorithm finds natural patterns in the data that might not be obvious at first glance.
Real-World Example:
• You might think all wealthy countries have great healthcare access
• But some wealthy countries spend relatively little on healthcare (low priority)
• While some developing countries spend a high percentage of their GDP on health (high priority)
• Clustering reveals these nuanced patterns automatically
We analyze countries using three key healthcare indicators:
1. Health Expenditure per Capita (USD)
• What it shows: How much each person's healthcare costs on average
• Why it matters: Higher spending often means better access to care
• Examples: USA ~$12,000/person, India ~$75/person
2. Health Expenditure as % of GDP
• What it shows: How much of a country's economy goes to healthcare
• Why it matters: Shows national healthcare priority level
• Examples: USA ~18% of GDP, Bangladesh ~2.5% of GDP
3. Economic Capacity (Total GDP)
• What it shows: Size of the country's overall economy
• Why it matters: Larger economies can afford more healthcare infrastructure
• Examples: USA ~$25 trillion, Luxembourg ~$85 billion
K-means is like organizing a scattered group of people into 4 distinct circles:
Step-by-Step Process:
- Start with Raw Data - Plot all 175 countries in 3D space (spending, priority, capacity)
- Standardize Measurements - Make sure no single metric dominates (like converting feet and meters to same scale)
- Find Natural Groups - Algorithm finds 4 cluster centers that minimize within-group differences
- Assign Countries - Each country goes to its nearest cluster center
- Label Clusters - We interpret what each cluster represents in real-world terms
Why 4 Clusters?
• Simple enough - Easy to understand and interpret
• Detailed enough - Captures major healthcare access patterns
• Research-backed - Studies show 4-5 clusters work well for country-level health analysis
🟢 High Access - Advanced Economy
• Profile: Wealthy countries with high absolute spending
• Characteristics: >$1,500 per capita AND >$100B GDP
• Examples: USA, Germany, Japan, France
• Healthcare Reality: Advanced facilities, latest technology, comprehensive coverage
🟡 Medium-High Access - Developing
• Profile: Growing economies with good healthcare priority
• Characteristics: >$300 per capita AND >6% of GDP
• Examples: Brazil, Thailand, South Africa, Malaysia
• Healthcare Reality: Expanding systems, improving access, mixed public-private
🔵 High Priority - Limited Resources
• Profile: Countries prioritizing health despite economic constraints
• Characteristics: >6.5% of GDP AND <$400 per capita
• Examples: Sierra Leone, Malawi, Nepal, Cambodia
• Healthcare Reality: Strong political commitment, efficient use of limited resources
🔴 Low Access - Resource Constrained
• Profile: Countries with limited healthcare spending and capacity
• Characteristics: Lower across all metrics
• Examples: Afghanistan, Chad, Central African Republic
• Healthcare Reality: Basic care only, significant infrastructure needs
Different clusters need different types of international health support:
• High Access: Technology sharing, research collaboration
• Medium-High: Infrastructure investment, specialist training
• High Priority: Financial support, capacity building
• Low Access: Basic infrastructure, emergency health systems
Data Preprocessing:
• Missing Data Handling: Use 2020-2022 averages for stability
• Outlier Management: Cap extreme values to prevent distortion
• Feature Scaling: Standardize all metrics to 0-1 range
Clustering Algorithm:
• Method: K-means with k=4 clusters
• Initialization: K-means++ for better starting points
• Iterations: Up to 300 iterations with early stopping
• Random State: Fixed seed (42) for reproducible results
Validation Methods:
• Silhouette Analysis: Measures how well-separated clusters are
• Elbow Method: Confirms 4 clusters is optimal number
• Domain Expert Review: Healthcare professionals validate cluster interpretations
• Unified Schema: Converts different dataset formats into consistent structure
• Country Standardization: Maps various country names to standard format
• Date Normalization: Handles different date formats across datasets
• Missing Data Handling: Fills gaps and handles inconsistencies
• Performance Optimization: Processes 63k+ records efficiently
• EpidemicForecaster Class: Main forecasting logic with epidemic-specific adjustments
• Exponential Smoothing: Proven time series method adapted for epidemic curves
• EpidemicTimeSeriesModel: PyTorch-based neural network for complex pattern recognition
• Trend Detection: Automatically identifies growth/decline patterns
• Batch Processing: Handles multiple countries simultaneously
• Error Handling: Graceful fallbacks for insufficient data
• Model Selection: User choice between traditional and machine learning approaches
• K-means Implementation: Scikit-learn clustering with 4 optimized clusters
• Data Preprocessing: 3-year averaging (2020-2022) for pandemic stability
• Interactive Visualizations: Enhanced scatter plots with GDP-based point sizing
• Statistical Analysis: Correlation analysis and efficiency ratio calculations
• Multi-Tab Interface: Health spending, global distribution, and economic patterns
• InsightGenerator Class: Converts numerical forecasts to human-readable text
• Trend Metrics: Calculates 1-month, 3-month, and 6-month percentage changes
• Confidence Assessment: Evaluates reliability based on historical data quality
• Natural Language: Generates insights in conversational English
• Header Section: Title, navigation, and key metrics cards
• Filter Panel: Disease, country, date, and metric selection
• Main Chart Area: Interactive time series with forecasting
• Insights Panel: Real-time generated summaries and trend analysis
• Comparison Charts: Bar charts for country-to-country analysis
• Three-Tab Design: Health spending, global distribution, and economic patterns
• Interactive Controls: Point size adjustment, jitter options, log scaling, trend lines
• Smart Color Coding: Consistent four-cluster color scheme across all visualizations
• Comprehensive Statistics: Summary tables, correlation analysis, and efficiency metrics
• Country Listings: Organized by cluster with spending averages and key metrics
When using "Project to 2025" mode, users see:
• Prominent warnings about scenario planning nature
• Reliability scores (5-6/10) throughout the interface
• Clear guidance on appropriate vs inappropriate uses
• Educational tooltips explaining assumptions and limitations
Unlike many educational projects that use simulated data, EpiAccess analyzes authentic epidemic records from three major global outbreaks. This provides genuine insights into how diseases spread and how healthcare systems respond.
The healthcare access clustering isn't just academic - it identifies real patterns that could inform international health policy and resource allocation decisions.
We use proven epidemiological methods (exponential smoothing) but present results in plain English that non-experts can understand and act upon.
Rather than overselling capabilities, we're explicit about reliability levels, appropriate use cases, and methodology limitations.
Users can adjust clustering parameters, forecasting options, and visualization settings to explore the data from different angles.
The dashboard now offers two complementary forecasting methods:
• Exponential Smoothing: Traditional statistical approach with high transparency
• PyTorch Neural Network: Machine learning approach for complex pattern recognition
• Educational Comparison: Users can compare both methods to understand their strengths and limitations
The dashboard incorporates research from Marcos et al. on epidemic forecasting and healthcare utilization:
• Pattern Recognition: Advanced algorithms for identifying epidemic curve patterns
• Healthcare Utilization Analysis: Methods for understanding healthcare system responses
• Visualization Techniques: Research-backed approaches to data visualization
• Educational Framework: Structured learning about epidemic forecasting principles
• Primary Use: Educational analysis and learning about epidemic patterns
• NOT for: Real-time health decisions, clinical guidance, or policy making
• Data Limitations: Historical data may not reflect current healthcare improvements or changes
• 6-Month Forecasts: Medium reliability (6-7/10) for trend analysis and educational purposes
• Best for: Understanding epidemic patterns, comparative analysis, learning forecasting concepts
• Limitations: Cannot predict policy changes, new variants, external shocks, or healthcare improvements
• Educational Value: Excellent for learning about epidemic curves and forecasting principles
• Scenario Planning Only: Medium-low reliability (4-5/10) for educational "what-if" exercises
• Good for: Understanding outbreak patterns, emergency planning concepts, comparative analysis
• NOT for: Actual predictions, policy decisions, economic planning, or real emergency planning
• Major assumptions: 2025 conditions identical to historical conditions (highly unrealistic)
• Educational Purpose: Demonstrates forecasting concepts and pattern recognition
• Pattern Recognition: High reliability (8-9/10) for identifying general access categories
• Good for: Understanding global health patterns, educational analysis, research concepts
• Limitations: Based on spending data only; doesn't capture healthcare quality, outcomes, or accessibility
• NOT for: Individual country detailed assessments, policy recommendations, or funding decisions
• Educational Focus: Great for learning about health economics and clustering analysis
• Python 3.8+ (Python 3.9+ recommended)
• Package Manager: pip or conda
• Memory: 4GB+ RAM recommended for clustering analysis
• Storage: 100MB+ for data files
# Clone the repository
git clone [repository-url]
cd EpiAccess
# Install dependencies
pip install -r requirements.txt
# Run the application
streamlit run Home.py# Create virtual environment (recommended)
python -m venv epiaccess_env
source epiaccess_env/bin/activate # On Windows: epiaccess_env\Scripts\activate
# Install specific packages if not using requirements.txt
pip install streamlit>=1.40.0 pandas>=2.2.0 numpy>=1.26.4
pip install plotly>=5.22.0 scikit-learn>=1.4.0 openpyxl>=3.1.0
pip install matplotlib>=3.8.4 seaborn>=0.13.2
# Verify installation
python -c "import streamlit, pandas, sklearn, plotly; print('All dependencies installed successfully!')"The app expects these files in the data/ directory:
Epidemic Data:
• cleaned_covid_data.csv - COVID-19 time series data
• sars_2003_complete_dataset_clean.csv - SARS outbreak data
• Daily_Country_Monkeypox_Confirmed_Cases.csv - Monkeypox case data
Healthcare Access Data:
• cleaned_health_expenditure.xlsx - World Bank health spending data (2015-2022)
- Download Data: Ensure all required data files are in the
data/directory - Test Access: Visit
http://localhost:8501after running the app - Performance Check: Allow 10-15 seconds for initial clustering calculations
- Browser Compatibility: Works best with Chrome, Firefox, or Safari
├── Home.py # Main landing page with navigation
├── pages/
│ ├── Disease Trends.py # Main forecasting dashboard
│ ├── Disease Map.py # Interactive disease mapping
│ └── Healthcare Access.py # Healthcare access clustering analysis
├── utils/
│ ├── data_processor.py # Data cleaning and unification
│ └── forecast_engine.py # Forecasting and insights engine
├── data/
│ ├── cleaned_covid_data.csv # COVID-19 epidemic data
│ ├── sars_2003_complete_dataset_clean.csv # SARS outbreak data
│ ├── Daily_Country_Monkeypox_Confirmed_Cases.csv # Monkeypox data
│ └── cleaned_health_expenditure.xlsx # World Bank health spending data
├── requirements.txt # Python package dependencies
└── README.md # This comprehensive documentation
• Healthcare Access Clustering: Advanced K-means analysis with interactive visualizations
• Enhanced Data Processing: World Bank health expenditure integration
• Interactive Controls: Point size adjustment, jitter options, log scaling
• Statistical Analysis: Correlation analysis and efficiency metrics
• Multi-Tab Visualizations: Health spending, global distribution, and economic patterns
• Streamlined Navigation: Clean, three-page application structure
• PyTorch Integration: Added neural network forecasting as an alternative to exponential smoothing
• Marcos Research Implementation: Incorporated research-backed visualization and analysis techniques
• Predictive Access Models: Forecasting healthcare access trends over time
• Enhanced Forecasting: Additional statistical models and ensemble methods
• Real-time Data: Integration with live epidemic monitoring systems
• Mobile Optimization: Responsive design for mobile devices
• Cluster Validation: Advanced statistical validation of healthcare access clusters
• Performance Optimization: Faster clustering calculations and data loading
• Additional Diseases: Expand to include more epidemic datasets (influenza, dengue, etc.)
• Advanced Analytics: Deep learning models for complex pattern recognition
• Export Capabilities: PDF reports, Excel exports, and data download options
• User Experience: Enhanced tooltips, better error handling, and guided tours