Skip to content

hrygo/openclaw-watchdog

Repository files navigation

OpenClaw Gateway Watchdog

A Go-based watchdog daemon that monitors OpenClaw Gateway health and automatically restarts it when application-level failures are detected.

Features

  • Cross-Platform — Supports Linux, macOS, and Windows
  • HTTP Health Monitoring — Polls GET /health endpoint every 30s (configurable)
  • Multi-Format Health Checks — Supports {"ok":true}, {"ready":true}, {"mcpReady":true}, and {"status":"live"|"ready"|"ok"|"healthy"}
  • Application-Level Supervision — Detects application failures, timeouts, and degraded states that systemd cannot see
  • Exponential Backoff with Jitter — Prevents restart storms on persistent failures (30s → 60s → 120s... max 5m)
  • Single Instance — Uses platform-native file locking to prevent duplicate instances
  • Graceful Shutdown — Handles SIGTERM/SIGINT with proper cleanup
  • Dual Output Logging — JSON to file + human-readable text to stderr
  • Prometheus Metrics — Optional /metrics endpoint for monitoring
  • Flexible Configuration — TOML, environment variables, or CLI flags
  • Multi-Service Guardian Mode — Orchestrates startup/shutdown of dependent services with topological ordering
  • Monitor-Only Services — Services can be monitored without restart commands (Guardian mode)
  • Child Process Cleanup — Prevents mcp-server process leaks during gateway restarts
  • Deadlock Circuit Breaker — Prevents restart loops when failures persist
  • Health Check Retries — Automatic retry with exponential backoff for transient failures (PR #4)
  • Startup Grace Period — Configurable grace period after restart prevents false positives (PR #4)

Supported Platforms

Platform Service Manager File Locking
Linux systemd flock(2)
macOS launchd flock(2)
Windows Windows Service LockFile API

Quick Start

Linux

# Clone and build
git clone https://github.com/hrygo/openclaw-watchdog.git
cd openclaw-watchdog
make build

# Install
sudo make install

# Install systemd service (optional)
make install-service

# Start
systemctl --user start openclaw-watchdog

macOS

# Build
make build-darwin

# Install
make install-darwin

# Install launchd service
make install-launchd

# Start
launchctl load ~/Library/LaunchAgents/openclaw-watchdog.plist

Windows

# Build
go build -o openclaw-watchdog.exe ./cmd/watchdog

# Install (run as Administrator)
.\scripts\install-windows.bat

Configuration

Legacy Mode (Single Service)

# ~/.config/openclaw-watchdog/watchdog.toml
health_url = "http://localhost:18789/health"
check_interval = "30s"
timeout = "10s"
failure_threshold = 3
restart_timeout = "60s"
base_backoff = "30s"
max_backoff = "5m"
cooldown_checks = 5
metrics_addr = ":9090"

Guardian Mode (Multi-Service)

# Multi-service orchestration
[[services]]
name = "gateway"
type = "http"
health_url = "http://localhost:18789/health"
restart_command = ["systemctl", "--user", "restart", "openclaw-gateway"]
check_interval = "30s"
failure_threshold = 3

[[services]]
name = "claude-mem-worker"
type = "http"
health_url = "http://localhost:37777/api/readiness"
restart_command = ["systemctl", "--user", "restart", "claude-mem-worker"]
depends_on = ["gateway"]

[deadlock]
max_restarts = 5
window = "10m"
half_open_after = "2m"
success_to_close = 3

More configuration options: See docs/configuration.md

Auto-Detected Paths

All paths are auto-detected based on platform. No configuration required:

Platform Config Location Data Location
Linux ~/.config/openclaw-watchdog/ ~/.local/share/openclaw-watchdog/
macOS ~/Library/Application Support/openclaw-watchdog/ ~/Library/Application Support/openclaw-watchdog/
Windows %APPDATA%\openclaw-watchdog\ %APPDATA%\openclaw-watchdog\

Prometheus Metrics

Enable metrics server with -metrics-addr :9090:

curl http://localhost:9090/metrics

Available metrics:

Metric Type Description
openclaw_watchdog_total_checks counter Total health checks
openclaw_watchdog_failed_checks counter Failed health checks
openclaw_watchdog_timeout_checks counter Timeout events (PR #4)
openclaw_watchdog_unreachable_checks counter Unreachable events (PR #4)
openclaw_watchdog_total_restarts counter Gateway restarts triggered
openclaw_watchdog_current_backoff_seconds gauge Current backoff duration
openclaw_watchdog_consecutive_fails gauge Consecutive failure count
openclaw_watchdog_last_check_latency_ms gauge Last check latency

Service Management

Linux (systemd)

# Start
systemctl --user start openclaw-watchdog

# Stop
systemctl --user stop openclaw-watchdog

# Restart
systemctl --user restart openclaw-watchdog

# View status
systemctl --user status openclaw-watchdog

# View logs
journalctl --user -u openclaw-watchdog -f

# Uninstall
make uninstall-service

macOS (launchd)

# Load service
launchctl load ~/Library/LaunchAgents/openclaw-watchdog.plist

# Unload service
launchctl unload ~/Library/LaunchAgents/openclaw-watchdog.plist

# View logs
tail -f ~/Library/Logs/openclaw-watchdog.log

Documentation

For Users:

For Developers:

Project History:

Full Documentation Index: docs/README.md

Troubleshooting

Watchdog not restarting gateway

  1. Check watchdog status: systemctl --user status openclaw-watchdog
  2. Verify restart command works manually
  3. Check logs: journalctl --user -u openclaw-watchdog -n 50

mcp-server processes leaking

Ensure Guardian mode is enabled with [[services]] configured. See docs/troubleshooting.md for details.

Service stuck in restart loop

The deadlock circuit breaker will block restarts after too many failures. Check logs for circuit breaker open and investigate root cause.

More troubleshooting: docs/troubleshooting.md

Build from Source

Prerequisites

  • Go 1.21+
  • GNU Make (optional)

Build Commands

# Build for current platform
make build

# Cross-compile for all platforms
make build-all

# Run tests
make test

# Run tests with coverage
go test ./... -race -cover

Contributing

IMPORTANT: All contributions MUST follow the Git Workflow. Do not push directly to the main repository.

  1. Fork the repository
  2. Create a feature branch
  3. Make changes and run tests
  4. Push to your fork (NOT upstream)
  5. Create Pull Request

See: docs/git-workflow.md for complete workflow with real examples.

License

MIT

About

A Go daemon that monitors OpenClaw Gateway health and automatically restarts it on failure

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors