Skip to content

Design note: gopher-agent on client machines (gRPC) — replaces SSH for ongoing ops, post-v1.0 #84

@smalex-z

Description

@smalex-z

TL;DR

Gopher's biggest architectural gap isn't tunnel internals — it's that the VPS has no direct channel to client machines for anything beyond bootstrap. Status comes from rathole's own connection, config updates require SSH, logs aren't streamed, and there's no place to put health checks or self-heal. A small gopher-agent running on each client (gRPC) is the natural fix.

This is not a v1.0 deliverable. The actionable part for now is shaping the bootstrap flow + Machine model so an agent can drop in later additively.

What's missing today

The current control path is:

VPS ──SSH (bootstrap only)──► Client
VPS ◄──rathole tunnel data───► Client     (no control plane on top)

Concretely that means:

  • "Is the client healthy?" → only signal is whether the rathole tunnel is up. CPU/disk/memory pressure on the client is invisible.
  • "Push a new rathole config" → requires SSH back into the client. Re-uses the bootstrap key, re-prompts for sudo, brittle.
  • "Stream rathole logs to the dashboard" → not possible; logs only exist on the client, dashboard would need to SSH+journalctl -f.
  • "Restart rathole because it crashed" → no remote affordance. User has to log in.

Each of those is its own tracked issue (#12, #53, #42, #79); a gopher-agent is the common substrate that lets all of them be implemented cleanly instead of as four bespoke side-channels.

Proposed shape

gopher-agent is a small daemon that ships alongside rathole-client, registers a gRPC server on a known local port, and is reached from the VPS through the existing rathole tunnel (so no new firewall rules, no public listener on the client). The VPS opens a gRPC connection over the tunnel back-channel.

A compact proto sketch — full schema is implementation work, this is just the shape:

service Agent {
  rpc GetStatus(GetStatusRequest)              returns (GetStatusResponse);
  rpc StreamMetrics(StreamMetricsRequest)      returns (stream Metric);
  rpc UpdateRatholeConfig(UpdateConfigRequest) returns (UpdateConfigResponse);
  rpc RestartRathole(RestartRequest)           returns (RestartResponse);
  rpc GetLogs(GetLogsRequest)                  returns (stream LogEntry);
  rpc RunDiagnostics(DiagnosticsRequest)       returns (DiagnosticsResponse);
}

The four things that earn their keep:

  1. Status — one round-trip returns {cpu, mem, disk, rathole_running, active_tunnels}. Replaces the rathole-up/down approximation we have today.
  2. Config push — VPS writes a new client.toml directly via the agent + signals reload. Removes SSH from the steady-state hot path; SSH stays only for first-bootstrap.
  3. Streaming logs / metrics — dashboard can subscribe instead of polling /api/tunnels every N seconds. Lower server load, real-time UX.
  4. Diagnostics — typed RunDiagnostics() returns structured pass/fail across "can reach VPS", "rathole config valid", "ports open" — feeds the "Diagnose" affordance instead of asking users to read logs.

Self-healing falls out for free once status + restart exist:

if !status.Rathole.Running { agent.RestartRathole() }

What to NOT lock out today

The only near-term cost of this design note is keeping a few doors open so the agent is additive when it lands:

  • Bootstrap script structure — keep templates/bootstrap.sh shaped so adding a "fetch + install gopher-agent" step is a single insertion, not a rewrite. The script already installs rathole-client + a systemd unit; the agent is the same shape.
  • Machine model — leave room for an agent_addr / agent_status column. Don't need to add it now; just don't paint into a corner that would require a destructive migration to introduce it.
  • Status reporting plumbing — when MachineService.Status() is touched, prefer one funnel that returns (rathole_status, optional system_status) rather than scattering "is rathole up" checks across handlers. Makes "now also surface agent-derived data" a one-place change.

That's it. No protobuf in the tree yet, no agent binary yet.

Out of scope (record + park)

  • Splitting the gopher-server binary into microservices (ConfigService / TunnelService / MonitoringService / etc.). Different conversation; only worth having when scale demands it. Mentioning so future-me doesn't conflate it with this issue.
  • Replacing the public REST API with gRPC. Browsers and integrators expect REST. Public surface stays REST; gRPC is the internal control transport.
  • gRPC-web for the dashboard. Live updates from VPS → browser are the natural next layer once the agent → VPS streaming exists, but the transport choice (gRPC-web vs. WebSocket vs. SSE) is its own decision and the simpler answer (SSE/WebSocket from VPS) is probably right.

Phasing

Soft phasing, not commitments:

  • Now (v1.x) — keep doing things the current way; design the touch points above so they don't preclude an agent.
  • When #12, #53, #79 start coming due — that's the inflection point. Their cleanest implementation is the agent; doing each as a one-off side-channel is throw-away work.
  • Later — agent is required, SSH narrows to first-bootstrap only.

Related

  • #12 — Health monitoring & auto-recovery. Most direct beneficiary; the agent IS the substrate for this.
  • #53 — Tunnel latency & network diagnostics. RunDiagnostics() covers it.
  • #42 — Caddy / rathole logs. GetLogs(stream) covers it.
  • #79 — Performance tracking & resource alerts. StreamMetrics covers it.
  • #78 — QUIC-based tunneling eval. Adjacent transport-layer rethink; coordinate the two so we don't relitigate the back-channel twice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions