Issue 5: Health Monitoring & Auto-Recovery
Labels: enhancement, monitoring, reliability
Problem
Users create tunnels but have no visibility into whether they're actually working. When tunnels fail, they stay broken until manual intervention.
Current State
Tunnels show "connected" status based on database state, not actual connectivity. No way to know if:
- rathole client crashed
- Service stopped responding
- Network connection lost
- Caddy routing broken
Proposed Solution
1. Background Health Check Service
Runs every 60 seconds, checks each tunnel:
- TCP connection test to VPS rathole port
- HTTP request to subdomain (for HTTP tunnels)
- Update tunnel status in database
2. Status Indicators
- 🟢 Healthy: Last check successful (<2 min ago)
- 🟡 Degraded: Check failed, attempting recovery
- 🔴 Down: Multiple failures, manual intervention needed
- ⚪ Unknown: Never checked or too old
3. Auto-Recovery Actions
When health check fails:
- Log failure with timestamp
- Attempt automatic recovery:
- SSH to machine
- Restart rathole client:
sudo systemctl restart rathole
- Wait 10 seconds
- Re-check health
- If recovery succeeds: log and mark healthy
- If recovery fails: mark down, alert user
4. Status Dashboard Updates
- Show health status on tunnel list (color-coded badges)
- Show "Last checked" timestamp
- Show "Last successful" timestamp
- Show recent health check history (sparkline graph)
- Manual "Test Now" button per tunnel
5. Health Check History
Store in database:
health_checks:
- tunnel_id
- checked_at (timestamp)
- result (success/failure)
- latency_ms
- error_message (if failed)
Display in Status tab:
- Overall uptime percentage per tunnel
- Downtime incidents (when, duration, cause)
- Performance graph (latency over time)
Implementation
Background Service:
func (s *HealthService) Start() {
ticker := time.NewTicker(60 * time.Second)
for range ticker.C {
tunnels := s.db.GetAllTunnels()
for _, tunnel := range tunnels {
go s.CheckTunnel(tunnel) // Parallel checks
}
}
}
func (s *HealthService) CheckTunnel(tunnel *Tunnel) {
// Test TCP connection to rathole port
conn, err := net.DialTimeout("tcp",
fmt.Sprintf("localhost:%d", tunnel.VPSPort),
5*time.Second)
if err != nil {
s.handleFailure(tunnel, err)
return
}
conn.Close()
// For HTTP tunnels, test HTTP request
if tunnel.Protocol == "http" {
resp, err := http.Get(fmt.Sprintf("https://%s.%s",
tunnel.Subdomain, tunnel.Domain))
if err != nil || resp.StatusCode >= 500 {
s.handleFailure(tunnel, err)
return
}
}
s.recordSuccess(tunnel)
}
Acceptance Criteria
Future Enhancements:
- Configurable check intervals
- Webhook notifications on failures
- Uptime SLA tracking
- Historical performance analytics
Issue 5: Health Monitoring & Auto-Recovery
Labels:
enhancement,monitoring,reliabilityProblem
Users create tunnels but have no visibility into whether they're actually working. When tunnels fail, they stay broken until manual intervention.
Current State
Tunnels show "connected" status based on database state, not actual connectivity. No way to know if:
Proposed Solution
1. Background Health Check Service
Runs every 60 seconds, checks each tunnel:
2. Status Indicators
3. Auto-Recovery Actions
When health check fails:
sudo systemctl restart rathole4. Status Dashboard Updates
5. Health Check History
Store in database:
Display in Status tab:
Implementation
Background Service:
Acceptance Criteria
Future Enhancements: