Health monitoring & Auto-recovery


# Issue 5: Health Monitoring & Auto-Recovery

**Labels:** `enhancement`, `monitoring`, `reliability`

## Problem
Users create tunnels but have no visibility into whether they're actually working. When tunnels fail, they stay broken until manual intervention.

## Current State
Tunnels show "connected" status based on database state, not actual connectivity. No way to know if:
- rathole client crashed
- Service stopped responding
- Network connection lost
- Caddy routing broken

## Proposed Solution

### 1. Background Health Check Service
Runs every 60 seconds, checks each tunnel:
- TCP connection test to VPS rathole port
- HTTP request to subdomain (for HTTP tunnels)
- Update tunnel status in database

### 2. Status Indicators
- 🟢 **Healthy:** Last check successful (<2 min ago)
- 🟡 **Degraded:** Check failed, attempting recovery
- 🔴 **Down:** Multiple failures, manual intervention needed
- ⚪ **Unknown:** Never checked or too old

### 3. Auto-Recovery Actions
When health check fails:
1. Log failure with timestamp
2. Attempt automatic recovery:
   - SSH to machine
   - Restart rathole client: `sudo systemctl restart rathole`
   - Wait 10 seconds
   - Re-check health
3. If recovery succeeds: log and mark healthy
4. If recovery fails: mark down, alert user

### 4. Status Dashboard Updates
- Show health status on tunnel list (color-coded badges)
- Show "Last checked" timestamp
- Show "Last successful" timestamp
- Show recent health check history (sparkline graph)
- Manual "Test Now" button per tunnel

### 5. Health Check History
Store in database:
```sql
health_checks:
  - tunnel_id
  - checked_at (timestamp)
  - result (success/failure)
  - latency_ms
  - error_message (if failed)
```

Display in Status tab:
- Overall uptime percentage per tunnel
- Downtime incidents (when, duration, cause)
- Performance graph (latency over time)

## Implementation

**Background Service:**
```go
func (s *HealthService) Start() {
    ticker := time.NewTicker(60 * time.Second)
    for range ticker.C {
        tunnels := s.db.GetAllTunnels()
        for _, tunnel := range tunnels {
            go s.CheckTunnel(tunnel) // Parallel checks
        }
    }
}

func (s *HealthService) CheckTunnel(tunnel *Tunnel) {
    // Test TCP connection to rathole port
    conn, err := net.DialTimeout("tcp", 
        fmt.Sprintf("localhost:%d", tunnel.VPSPort), 
        5*time.Second)
    
    if err != nil {
        s.handleFailure(tunnel, err)
        return
    }
    conn.Close()
    
    // For HTTP tunnels, test HTTP request
    if tunnel.Protocol == "http" {
        resp, err := http.Get(fmt.Sprintf("https://%s.%s", 
            tunnel.Subdomain, tunnel.Domain))
        if err != nil || resp.StatusCode >= 500 {
            s.handleFailure(tunnel, err)
            return
        }
    }
    
    s.recordSuccess(tunnel)
}
```

## Acceptance Criteria
- [ ] Health checks run automatically every 60 seconds
- [ ] Tunnel status reflects actual connectivity
- [ ] Failed tunnels trigger automatic recovery
- [ ] Status dashboard shows real-time health
- [ ] Health check history stored and displayed
- [ ] Manual "Test Now" button works
- [ ] Performance impact minimal (<5% CPU)

**Future Enhancements:**
- Configurable check intervals
- Webhook notifications on failures
- Uptime SLA tracking
- Historical performance analytics

---

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health monitoring & Auto-recovery #12

Issue 5: Health Monitoring & Auto-Recovery

Problem

Current State

Proposed Solution

1. Background Health Check Service

2. Status Indicators

3. Auto-Recovery Actions

4. Status Dashboard Updates

5. Health Check History

Implementation

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Health monitoring & Auto-recovery #12

Description

Issue 5: Health Monitoring & Auto-Recovery

Problem

Current State

Proposed Solution

1. Background Health Check Service

2. Status Indicators

3. Auto-Recovery Actions

4. Status Dashboard Updates

5. Health Check History

Implementation

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions