Skip to content

Health monitoring & Auto-recovery #12

@smalex-z

Description

@smalex-z

Issue 5: Health Monitoring & Auto-Recovery

Labels: enhancement, monitoring, reliability

Problem

Users create tunnels but have no visibility into whether they're actually working. When tunnels fail, they stay broken until manual intervention.

Current State

Tunnels show "connected" status based on database state, not actual connectivity. No way to know if:

  • rathole client crashed
  • Service stopped responding
  • Network connection lost
  • Caddy routing broken

Proposed Solution

1. Background Health Check Service

Runs every 60 seconds, checks each tunnel:

  • TCP connection test to VPS rathole port
  • HTTP request to subdomain (for HTTP tunnels)
  • Update tunnel status in database

2. Status Indicators

  • 🟢 Healthy: Last check successful (<2 min ago)
  • 🟡 Degraded: Check failed, attempting recovery
  • 🔴 Down: Multiple failures, manual intervention needed
  • Unknown: Never checked or too old

3. Auto-Recovery Actions

When health check fails:

  1. Log failure with timestamp
  2. Attempt automatic recovery:
    • SSH to machine
    • Restart rathole client: sudo systemctl restart rathole
    • Wait 10 seconds
    • Re-check health
  3. If recovery succeeds: log and mark healthy
  4. If recovery fails: mark down, alert user

4. Status Dashboard Updates

  • Show health status on tunnel list (color-coded badges)
  • Show "Last checked" timestamp
  • Show "Last successful" timestamp
  • Show recent health check history (sparkline graph)
  • Manual "Test Now" button per tunnel

5. Health Check History

Store in database:

health_checks:
  - tunnel_id
  - checked_at (timestamp)
  - result (success/failure)
  - latency_ms
  - error_message (if failed)

Display in Status tab:

  • Overall uptime percentage per tunnel
  • Downtime incidents (when, duration, cause)
  • Performance graph (latency over time)

Implementation

Background Service:

func (s *HealthService) Start() {
    ticker := time.NewTicker(60 * time.Second)
    for range ticker.C {
        tunnels := s.db.GetAllTunnels()
        for _, tunnel := range tunnels {
            go s.CheckTunnel(tunnel) // Parallel checks
        }
    }
}

func (s *HealthService) CheckTunnel(tunnel *Tunnel) {
    // Test TCP connection to rathole port
    conn, err := net.DialTimeout("tcp", 
        fmt.Sprintf("localhost:%d", tunnel.VPSPort), 
        5*time.Second)
    
    if err != nil {
        s.handleFailure(tunnel, err)
        return
    }
    conn.Close()
    
    // For HTTP tunnels, test HTTP request
    if tunnel.Protocol == "http" {
        resp, err := http.Get(fmt.Sprintf("https://%s.%s", 
            tunnel.Subdomain, tunnel.Domain))
        if err != nil || resp.StatusCode >= 500 {
            s.handleFailure(tunnel, err)
            return
        }
    }
    
    s.recordSuccess(tunnel)
}

Acceptance Criteria

  • Health checks run automatically every 60 seconds
  • Tunnel status reflects actual connectivity
  • Failed tunnels trigger automatic recovery
  • Status dashboard shows real-time health
  • Health check history stored and displayed
  • Manual "Test Now" button works
  • Performance impact minimal (<5% CPU)

Future Enhancements:

  • Configurable check intervals
  • Webhook notifications on failures
  • Uptime SLA tracking
  • Historical performance analytics

Metadata

Metadata

Assignees

No one assigned

    Labels

    betaRequired for beta releasefeatureNew feature or request

    Projects

    Status

    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions