Automated SRAG (Severe Acute Respiratory Syndrome) analytics using LangGraph agents, OpenAI, and DATASUS data
- Overview
- Architecture
- Features
- Prerequisites
- Installation
- Configuration
- Usage
- API Documentation
- Governance & Security
- Development
- Project Structure
- License
This project implements an AI-powered analytics system for SRAG (Síndrome Respiratória Aguda Grave) data from Open DATASUS. It uses LangGraph agents to orchestrate:
- SQL queries (with safety guardrails)
- Real-time news retrieval (Tavily Search)
- RAG over data dictionary (pgvector)
- Automated report generation
- Taxa de aumento de casos (Case increase rate)
- Taxa de mortalidade (Mortality rate)
- Taxa de ocupação de UTI (ICU occupancy rate)
- Taxa de vacinação (Vaccination rate)
- Daily cases (last 30 days)
- Monthly cases (last 12 months)
The system follows a microservices architecture with:
- PostgreSQL (with pgvector) for data storage and vector search
- FastAPI backend exposing LangGraph agents via REST API
- Streamlit frontend for interactive dashboards
- Docker containers for backend and database
The system uses LangGraph to orchestrate a multi-step agentic workflow with parallel execution for improved performance. The first three data-gathering nodes run concurrently (fan-out), then converge (fan-in) at the report writing stage.
graph TD
A[User Request] --> B[SRAGReportAgent];
B --> C[Node 1: calculate_metrics];
B --> D[Node 2: fetch_news];
B --> E[Node 3: generate_charts];
C --> F[Node 4: write_report];
D --> F;
E --> F;
F --> G[Node 5: create_audit];
G --> H[Return: Report + Audit Trail];
C -.SQL.-> DB[(PostgreSQL)];
D -.Tavily API.-> News[News Sources];
D -.OpenAI.-> LLM1[gpt-4o-mini];
E -.SQL.-> DB;
F -.OpenAI.-> LLM2[gpt-5];
subgraph parallel [" ⚡ PARALLEL EXECUTION"]
C
D
E
end
style C fill:#e1f5fe
style D fill:#f3e5f5
style E fill:#e8f5e9
style F fill:#fff3e0
style G fill:#fce4ec
style parallel fill:#fff3e0,stroke:#ff9800,stroke-width:2px
1. calculate_metrics Node — runs in parallel
- Purpose: Computes 4 core SRAG metrics from DATASUS database
- Calls:
metrics_tool.calculate_all_metrics(days, state_filter) - Operations:
- Case increase rate (current vs previous period)
- Mortality rate (deaths / total cases)
- ICU occupancy rate (ICU admissions / hospitalizations)
- Vaccination rate (vaccinated cases / total cases)
- Output: Dictionary with all 4 metrics + metadata
2. fetch_news Node — runs in parallel
- Purpose: Retrieves recent Portuguese news about SRAG
- Calls:
news_tool.search_srag_news(days, max_results=10)- Tavily search APInews_tool._extract_date_with_llm(title, content)- gpt-4o-mini for date extraction
- Operations:
- Searches Brazilian news domains (G1, Folha, CNN Brasil, Fiocruz, etc.)
- Filters by SRAG-related keywords
- Extracts publication dates using LLM when Tavily doesn't provide them
- Validates dates are within requested time window
- Output: News citations with title, URL, date, and content preview
3. generate_charts Node — runs in parallel
- Purpose: Prepares time-series data for frontend visualization
- Calls:
metrics_tool.get_daily_cases_chart_data(days)metrics_tool.get_monthly_cases_chart_data(months=12)
- Operations:
- Queries daily metrics for trend analysis
- Aggregates monthly data for 12-month overview
- Output: Chart data arrays for daily and monthly visualizations
4. write_report Node — waits for all parallel nodes
- Purpose: Generates human-readable report in Portuguese
- Calls:
ChatOpenAI(model="gpt-5").invoke(messages) - Operations:
- Receives merged state from all 3 parallel nodes
- Synthesizes metrics and news context
- Produces structured markdown report (~500 words)
- Sections: Executive Summary, Metrics Analysis, News Context, Conclusion
- Output: Complete Portuguese SRAG report
5. create_audit Node
- Purpose: Creates audit trail for transparency and debugging
- Operations:
- Captures all state transitions and messages
- Logs SQL queries executed
- Saves full execution log to
/logsdirectory - Filters messages for user-facing audit trail
- Output: JSON audit trail with execution metadata
The workflow uses custom reducers to support parallel state updates. When multiple nodes run concurrently, LangGraph merges their outputs using these reducers:
class ReportState(TypedDict):
# Message accumulation (all nodes append)
messages: Annotated[Sequence[BaseMessage], add]
# Read-only config (keep first value from initialization)
days: Annotated[int, keep_first]
state_filter: Annotated[Optional[str], keep_first]
# Parallel node outputs (merge dictionaries/lists)
metrics: Annotated[Optional[Dict[str, Any]], merge_dicts]
news_citations: Annotated[Optional[list], merge_lists]
chart_data: Annotated[Optional[Dict[str, Any]], merge_dicts]
# Sequential node outputs (keep latest)
final_report: Annotated[Optional[str], keep_latest]
audit_trail: Annotated[Optional[Dict[str, Any]], keep_latest]Fan-out/Fan-in Pattern:
- Fan-out:
STARTbranches to 3 parallel nodes simultaneously - Parallel execution: Each node returns only the fields it updates
- Fan-in: LangGraph merges all outputs using reducers before
write_report - Sequential: Report writing and audit creation run sequentially
The system also includes a conversational chat agent (chat_agent.py) that uses the ReAct pattern for interactive Q&A. Users can ask natural language questions and the agent autonomously decides which tools to use:
query_database- Executes validated SQL queries viasql_tool.pysearch_news- Searches recent SRAG newslookup_field- Looks up data dictionary definitionsget_metrics- Retrieves current SRAG metricsget_table_schema- Gets database table schemas
The SQL tool includes safety guardrails (SELECT-only, allowed tables whitelist, query validation). See Architecture Documentation for details.
Before starting, ensure you have:
- Python 3.11+ - Download here
- uv - Fast Python package manager - Install guide
- Docker & Docker Compose - Install Docker Desktop
- OpenAI API key - Get from OpenAI Platform
- Tavily API key - Register at Tavily
# 1. Clone the repository
git clone <your-repo-url>
cd srag-analytics
# 2. Install uv (if not already installed)
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# 3. Configure API keys
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY and TAVILY_API_KEY
# 4. Start Docker services (backend + PostgreSQL)
docker-compose up -d
# 5. Install Python dependencies with uv
uv sync
# 6. Download all data from https://opendatasus.saude.gov.br/dataset/srag-2021-a-2024 and the PDF dictionary and put inside the /data folder
# 7. Ingest SRAG data (runs inside Docker container, takes 5-10 minutes)
docker-compose exec backend python -m backend.db.ingestion
# (optional) Parser for the future SQL generator agent
docker-compose exec backend python -m backend.db.dictionary_parser
# 8. Launch Streamlit frontend (runs outside container with uv)
uv run streamlit run frontend/app.pyOpen http://localhost:8501 to access the dashboard.
Note: The frontend (Streamlit) runs on your local machine using uv, while the backend and database run in Docker containers.
For step-by-step instructions with troubleshooting, see:
- Quick Start: QUICKSTART.md - Get running in under 10 minutes
- Complete Setup: SETUP_INSTRUCTIONS.md - Detailed installation guide
Create a .env file in the project root with the following configuration:
# Required API Keys
OPENAI_API_KEY=sk-your-openai-key-here
TAVILY_API_KEY=tvly-your-tavily-key-here
# Database Configuration (default values shown)
DB_HOST=localhost
DB_PORT=5432
DB_NAME=srag_db
DB_USER=srag_user
DB_PASSWORD=srag_password
# API Configuration
API_HOST=0.0.0.0
API_PORT=8000
ENVIRONMENT=development
# LLM Configuration
LLM_MODEL=gpt-5 # Main model for report generation
LLM_MINI_MODEL=gpt-4o-mini # Auxiliary model for date extraction (fast, no extended thinking)
LLM_CHAT_MODEL=gpt-5-mini # Chat agent model
LLM_TEMPERATURE=0.3 # Slight creativity for natural report writing
LLM_MAX_TOKENS=2000
# News Search Configuration
TAVILY_MAX_RESULTS=10
NEWS_DAYS_DEFAULT=7Note: Database tables are created automatically when the backend container starts. No manual initialization is required.
The system creates:
srag_cases- Main SRAG data tabledata_dictionary- Data dictionary with embeddingsdaily_metrics- Materialized view for daily aggregatesmonthly_metrics- Materialized view for monthly aggregates
- Open http://localhost:8501
- Select a state filter (optional)
- Adjust time period (default: 30 days)
- Click "Generate Report" in the sidebar
- Wait 30-60 seconds for AI processing
- View:
- Metrics: 4 key SRAG indicators
- Charts: Daily and monthly case trends
- News: Recent SRAG-related news with dates
- Report: AI-generated analysis in Portuguese
- Download audit trail JSON for transparency
# Generate a report for the last 30 days
curl -X POST http://localhost:8000/generate_report \
-H "Content-Type: application/json" \
-d '{"days": 30, "state_filter": null}'
# Generate a state-specific report
curl -X POST http://localhost:8000/generate_report \
-H "Content-Type: application/json" \
-d '{"days": 30, "state_filter": "SP"}'GET /health- Health checkPOST /generate_report- Generate AI reportPOST /metrics- Get calculated metricsGET /sql/tables- List database tablesGET /docs- Interactive API documentation (Swagger)
All metrics are calculated using DATASUS field definitions from the official data dictionary. Below are the precise calculation methods and denominators used.
Definition: Percentage change in cases between current and previous period.
Formula:
Δ% = ((casos_período_atual - casos_período_anterior) / casos_período_anterior) * 100
Fields Used:
DT_SIN_PRI(Date of first symptoms) - Used to determine case period- Count all cases where
DT_SIN_PRIfalls within the specified time window
Denominator: Total cases in the previous period (N days before current period)
Example: For a 30-day report:
- Current period: Last 30 days → 1,000 cases
- Previous period: 30 days before that → 800 cases
- Increase rate:
((1000 - 800) / 800) * 100 = 25%
UI Label: Taxa de aumento de casos — Δ% (últimos N dias vs N anteriores) por DT_SIN_PRI
Definition: Percentage of deaths among cases in the period with a recorded outcome.
Formula:
Taxa = (óbitos / casos_com_evolução_registrada) * 100
Fields Used:
EVOLUCAO(Case outcome):1= Cura (Recovery)2= Óbito (Death)3= Óbito por outras causas (Death by other causes)9= Ignorado (Unknown)
DT_SIN_PRI(Date of first symptoms) - For period filtering
Calculation (from pre-computed daily_metrics table):
SELECT
SUM(new_deaths) AS total_deaths,
SUM(cases_with_outcome) AS total_cases_with_outcome,
(SUM(new_deaths)::float / SUM(cases_with_outcome)) * 100 AS mortality_rate
FROM daily_metrics
WHERE metric_date >= CURRENT_DATE - INTERVAL '30 days'
AND state IS NULL -- National totals (or filter by state)Where daily_metrics.cases_with_outcome is pre-computed as:
-- During ingestion
SUM(CASE WHEN evolucao IN (1, 2) THEN 1 ELSE 0 END) AS cases_with_outcomeDenominator: Cases with recorded outcome (EVOLUCAO IN (1, 2)) — not all cases in the period
Rationale: Using all cases (including those still hospitalized or with unknown outcomes) would artificially lower the mortality rate. We only include cases with a final outcome (recovery or death) excluding unknown outcomes.
UI Label: Taxa de mortalidade — óbitos (evolucao=2) ÷ casos com evolucao IN (1,2)
Definition: Percentage of hospitalized cases that required ICU admission.
Formula:
Taxa = (casos_internados_em_UTI / casos_hospitalizados) * 100
Fields Used:
HOSPITAL(Hospitalization status):1= Sim (Yes)2= Não (No)9= Ignorado (Unknown)
UTI(ICU admission):1= Sim (Yes)2= Não (No)9= Ignorado (Unknown)
DT_SIN_PRI(Date of first symptoms) - For period filtering
Calculation:
SELECT
COUNT(*) FILTER (WHERE uti = 1) AS icu_cases,
COUNT(*) AS hospitalized_cases,
(COUNT(*) FILTER (WHERE uti = 1)::float / COUNT(*)) * 100 AS icu_rate
FROM srag_cases
WHERE hospital = 1 -- Only hospitalized cases
AND dt_sin_pri >= CURRENT_DATE - INTERVAL '30 days'Denominator: Hospitalized cases (HOSPITAL = 1) — not all cases in the period
Rationale: ICU occupancy is a proxy for disease severity among hospitalized patients, not the general case population.
UI Label: Taxa de UTI entre internados (proxy) — UTI=1 ÷ HOSPITAL=1
Definition: Percentage of cases that received at least one COVID-19 vaccine dose.
Formula:
Taxa = (casos_vacinados / total_casos_no_período) * 100
Fields Used:
VACINA_COV(Received COVID-19 vaccine):1= Sim (Yes)2= Não (No)9= Ignorado (Unknown)
DT_SIN_PRI(Date of first symptoms) - For period filtering
Calculation:
SELECT
COUNT(*) FILTER (WHERE vacina_cov = 1) AS vaccinated,
COUNT(*) AS total_cases,
(COUNT(*) FILTER (WHERE vacina_cov = 1)::float / COUNT(*)) * 100 AS vaccination_rate
FROM srag_cases
WHERE dt_sin_pri IS NOT NULL
AND dt_sin_pri >= CURRENT_DATE - INTERVAL '30 days'Denominator: All cases in the period with a symptom date (DT_SIN_PRI IS NOT NULL)
Rationale: Missing vaccination data (VACINA_COV IS NULL or = 9) is treated as unvaccinated, following standard epidemiological practice. This provides a conservative estimate of vaccination coverage.
Alternative Denominator (not currently used): If you want vaccination rate "among cases with known vaccination status", add WHERE vacina_cov IN (1, 2) to both numerator and denominator.
UI Label: % de casos com registro de vacinação — vacina_cov=1 ÷ casos (todos os casos no período)
Definition: Percentage of cases that received 2 or more COVID-19 vaccine doses.
Formula:
Taxa = (casos_com_2+_doses / total_casos_no_período) * 100
Fields Used:
DOSE_2_COV(2nd dose date) - DATE field, not NULL if vaccinatedDOSE_REF(Booster dose date) - DATE field, not NULL if vaccinatedDOSE_2REF(2nd booster dose date) - DATE field, not NULL if vaccinatedDT_SIN_PRI(Date of first symptoms) - For period filtering
Calculation:
SELECT
COUNT(*) FILTER (WHERE (dose_2_cov IS NOT NULL OR dose_ref IS NOT NULL OR dose_2ref IS NOT NULL)) AS fully_vaccinated,
COUNT(*) AS total_cases,
(COUNT(*) FILTER (WHERE (dose_2_cov IS NOT NULL OR dose_ref IS NOT NULL OR dose_2ref IS NOT NULL))::float / COUNT(*)) * 100 AS full_vaccination_rate
FROM srag_cases
WHERE dt_sin_pri IS NOT NULL
AND dt_sin_pri >= CURRENT_DATE - INTERVAL '30 days'Denominator: All cases in the period with a symptom date
Rationale:
- A person is considered "fully vaccinated" if they have a recorded date for:
- 2nd dose (
DOSE_2_COV IS NOT NULL), OR - Booster dose (
DOSE_REF IS NOT NULL), OR - 2nd booster dose (
DOSE_2REF IS NOT NULL)
- 2nd dose (
- Anyone with booster doses is automatically ≥2 doses
- Missing dose dates (NULL values) are treated as unvaccinated
UI Label: % de casos "2+ doses" — (dose_2_cov IS NOT NULL OR dose_ref IS NOT NULL OR dose_2ref IS NOT NULL) ÷ todos os casos no período
The choice of denominator is critical for interpreting metrics correctly:
| Metric | Denominator | Why? |
|---|---|---|
| Case Increase Rate | Cases in previous period | To measure percentage change |
| Mortality Rate | Cases with known outcome (evolucao IN (1,2)) |
Pending/unknown cases would dilute the rate |
| ICU Occupancy | Hospitalized cases only (hospital=1) |
ICU admission only applies to hospitalized patients |
| Vaccination Rate | All cases in period | Missing vaccination data = unvaccinated (conservative) |
| Full Vaccination Rate | All cases in period | Conservative estimate of immunization coverage |
Transparency: All metric calculations use parameterized SQL queries visible in backend/tools/metrics_tool.py. No LLM-generated SQL is used in production.
Interactive API documentation is available at:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
The API follows RESTful conventions and returns JSON responses.
The system implements multiple layers of safety and governance:
-
SQL Injection Protection
- All database queries use parameterized SQL
- No LLM-generated SQL in production
- Pre-defined queries with type validation
- Read-only database user for query operations
-
Audit Trails
- Every report generation is logged to
/logsdirectory - Audit trail includes:
- All state transitions
- SQL queries executed
- LLM calls with prompts and responses
- Execution timestamps
- JSON format for easy parsing and analysis
- Every report generation is logged to
-
Prompt Management
- Centralized prompts in backend/agents/prompts.py
- XML-tagged prompts for better LLM instruction clarity
- Version-controlled for tracking changes
- Supports A/B testing and prompt optimization
-
LLM Safety
- Temperature set to 0.3 for balanced creativity in report writing (0 used for date extraction)
- Token limits enforced
- Retries with exponential backoff
- Error handling and graceful degradation
-
Data Privacy
- No PII (Personally Identifiable Information) stored
- DATASUS data is anonymized public health data
- Audit Trail Download: Users can download full execution logs as JSON
- Source Attribution: All news articles include URLs and publication dates
- Metric Calculations: Formulas are documented and visible
- Open Source: Full codebase available for inspection
- Rate limiting for API endpoints
- User authentication and authorization
- SQL query whitelisting for
sql_tool - Input validation and sanitization
- HTTPS/TLS for production deployments
srag-analytics/
├── backend/
│ ├── agents/ # LangGraph agents
│ │ ├── report_agent.py # Main orchestrator (fan-out/fan-in)
│ │ ├── chat_agent.py # ReAct chat agent for Q&A
│ │ ├── prompts.py # Centralized prompts
│ │ └── guardrails.py # Safety checks
│ ├── tools/ # Agent tools
│ │ ├── metrics_tool.py # Database metrics
│ │ ├── news_tool.py # Tavily news search
│ │ ├── sql_tool.py # Safe SQL execution
│ │ └── rag_tool.py # Data dictionary RAG
│ ├── db/ # Database setup
│ │ ├── init_database.py # Schema creation
│ │ ├── ingestion.py # CSV ingestion
│ │ └── dictionary_parser.py # PDF parsing
│ └── main.py # FastAPI application
├── frontend/
│ └── app.py # Streamlit UI (runs with uv)
├── data/ # SRAG CSV files
├── docs/ # Documentation
│ ├── architecture.md # System architecture docs
│ ├── architecture_diagram.pdf # Visual architecture
│ └── workflow_graph.png # LangGraph workflow
├── logs/ # Execution logs (auto-generated, Docker volume)
├── docker-compose.yml # Docker orchestration
├── Dockerfile.backend # Backend container
├── pyproject.toml # Python dependencies
└── uv.lock # Lockfile for reproducible builds
- Edit backend/tools/metrics_tool.py
- Add SQL query to calculate the metric
- Update
calculate_all_metrics()to include it - Update report prompt in backend/agents/prompts.py
- Add chart data function in backend/tools/metrics_tool.py
- Call it in
generate_charts_nodein backend/agents/report_agent.py - Create visualization in frontend/app.py
- Edit backend/agents/prompts.py
- Use XML tags for structure:
<task>,<guidelines>,<format>, etc. - Restart backend:
docker-compose restart backend - Test with sample reports
# Test specific endpoint
curl -X POST http://localhost:8000/generate_report \
-H "Content-Type: application/json" \
-d '{"days": 7}'
# Check logs
tail -f logs/report_*.jsonFor production deployment:
- Environment Variables: Update
.envwith production credentials - Database: Use managed PostgreSQL (AWS RDS, Google Cloud SQL, etc.)
- Scaling: Deploy backend with load balancer (multiple FastAPI instances)
- Monitoring: Add application monitoring (Datadog, New Relic, etc.)
- Logging: Centralize logs (ELK stack, CloudWatch, etc.)
- Security: Enable HTTPS, add authentication, rate limiting
This project is licensed under the MIT License. See LICENSE file for details.
- Architecture Documentation: docs/architecture.md
- Architecture Diagram: docs/architecture_diagram.pdf
- Workflow Graph: docs/workflow_graph.png
- Quick Start: QUICKSTART.md
- Setup Guide: SETUP_INSTRUCTIONS.md
Built using LangGraph, OpenAI, and DATASUS data