Fix/show management tunnels by smalex-z · Pull Request #86 · smalex-z/gopher

smalex-z · 2026-05-05T21:24:28Z

Description

Resolves #__
Supports #84

Changes Made

Completely reworks client side architecture and requires a client golang binary now.

Testing

Tested locally
Added/updated unit tests
All tests passing

Screenshots (if applicable)

The agent control plane runs as a private rathole service entry per machine — same shape as the existing machine-ssh tunnel but bound to 127.0.0.1 on both sides. Until now it was hidden from the dashboard, making it confusing when operators noticed an extra port on the VPS. Synthesizes a Tunnel row with Kind: "machine-agent" alongside the existing machine-ssh entry whenever the machine has agent fields populated. Status comes from AgentLastSeen freshness (active within two missed health polls = active, otherwise offline; pending before the first poll lands). Tunnels page sort now pins management entries (SSH, then agent) to the top of each machine's group so user tunnels stay together below them. Managed=true is already in place so the existing protected- tunnel guard prevents accidental deletion.

…up + notifies server Three things I missed in the agent rollout: 1. Bootstrap.sh was still installing the agent as User=$SSH_USER with no gopher system user and no /etc/sudoers.d/gopher entry. Migrate.sh on existing machines created the gopher user with NOPASSWD: ALL, but fresh bootstraps ended up in a different shape — inconsistent. Bootstrap now mirrors migrate.sh exactly: creates the gopher user, writes the sudoers rule, chowns /etc/rathole/client.toml + the agent config to gopher, and installs the systemd unit with User=gopher. 2. gopher-uninstall.sh had leftover "user-mode" cleanup (kill ~/bin/gopher- agent, strip user crontab) from an abandoned design path, AND it never removed the gopher system user. Stripped the dead code; added userdel gopher AFTER the agent service is stopped; sudoers cleanup runs last so prior steps could still use sudo. 3. When an operator runs gopher-uninstall locally on a client, the dashboard's machine list went stale — there was no callback to delete the server-side record. Now the script POSTs to /api/machines/self- delete with its agent token before tearing down. The endpoint resolves the token to a machine via db.GetMachineByAgentToken and calls MachineService.DeleteFromClient — a new variant of Delete that skips the remote-uninstall step (we're already running it locally) but still does server-side cleanup (tunnels, Caddy, rathole reconcile, machine row). The notification is best-effort: failures (no curl, expired DNS, server unreachable) leave the local cleanup proceeding normally — the operator can still delete from the dashboard manually.

You're right: the dashboard kept reporting machines as "connected" even after rathole-client was removed, as long as the gopher binary was alive. The agent's status endpoint already reports rathole.Active separately, but checkViaAgent's only response to "agent up, rathole down" was to log a failed health check and try recovery — it never updated machine.Status. Combined with monitor.go skipping agent-installed machines, nothing flipped the status off "connected" from the last good poll. Now when the agent answers but reports rathole inactive, we explicitly flip the machine to offline via a new SetMachineAgentDegraded helper. AgentLastSeen still updates (the back-channel works), but Status reflects the actual tunnel-serving capability. Synthesized SSH tunnel rows derive from machine.Status and will correctly show offline once this lands. The tunnel-status path (monitor.go's checkTunnels) already TCP-probes rathole bind ports independently, so per-tunnel statuses were correct — this fix is specifically about machine.Status.

…ootstrap grace window Three real bugs from the same agent rollout: 1. The agent's systemd unit defaulted to KillMode=control-group, which means systemctl-stopping gopher-agent kills its entire cgroup — including the detached gopher-uninstall worker spawned from POST /uninstall. The script gets murdered partway through cleanup. Both bootstrap.sh and migrate.sh now set KillMode=process so only the main agent dies and children continue. This also explains why "delete machine" on the dashboard appeared to do nothing on the client side: the cleanup STARTED but got killed before it could finish (or before it got to the self-rm line, which is why gopher-uninstall didn't delete itself either). 2. gopher-uninstall.sh's self-destruct line used plain `rm -f` instead of `$SUDO rm -f`. When invoked as root via `sudo gopher-uninstall` that worked, but if the script ever ran without sudo elevation (or got partially killed before reaching it) the binary survived. Added $SUDO and moved it BEFORE the sudoers cleanup so the privilege is still in scope. 3. The migration banner showed "agent isn't set up" for machines that were freshly bootstrapped — agent_installed=false until the first successful health poll (~60s after bootstrap). MachinesWithoutAgent now excludes machines under 10 minutes old. Bootstrap inline-installs the agent + the health service polls every 60s, so any machine still missing the agent flag after 10 minutes is a real problem; before that, it's just installation latency.

NextSSHTunnelPort() and NextRatholePort() were line-for-line identical (both walked allUsedPorts() from 1024 looking for the first gap), and bootstrap.go called them back-to-back with no DB write in between — so they returned the same port. The Machine row ended up with TunnelPort == AgentRemotePort, rathole-server tried to bind two services to the same address, and the back-channel was permanently broken on every freshly bootstrapped machine. NextRatholePort now takes a variadic excluding list. Bootstrap passes the SSH tunnel port to the second call so it can't be reused for the agent. NextSSHTunnelPort is removed — it was a duplicate name for the same function, and consolidating prevents this footgun from coming back later. Added a regression test that fails on the old behavior.

Both pages had refetchInterval set to false, so changes driven by the health service (60s poll loop) and monitor (30s TCP probes) only became visible after a manual refresh. Network Map already polled at 30s; now the rest of the dashboard matches. Machines page keeps its 3s burst-refresh during bootstrap-waiting so the "machine registered!" success state still flips fast — only the steady-state behavior changes from "static" to "30s".

Bumped steady-state refresh on Machines + Tunnels pages from 30s to 15s per request, plus a 5s middle tier on the Machines page while any machine is fresh (created < 5 min ago) or still in "pending" status. The post-bootstrap window is exactly when status flips happen fastest — rathole connecting, agent installing inline, first health poll landing — so 5s polling there means the operator sees the machine go pending → connected → agent-installed without manual refresh. refetchInterval is computed via the function form of react-query so the cadence self-adjusts: once every machine has settled and aged past 5 minutes, polling drops back to 15s automatically. No timers, no state, just a derived rate from the current data.

github-actions · 2026-05-05T21:24:57Z

Unit tests run: 275
Unit tests passed: 275
Test coverage: 25.6%

gitguardian · 2026-05-05T21:25:04Z

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
32478099	Triggered	Generic Password	`89b8243`	internal/api/handlers/templates/bootstrap.sh	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secret safely. Learn here the best practices.
Revoke and rotate this secret.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

smalex-z added 7 commits May 5, 2026 06:07

smalex-z merged commit 56c839b into main May 5, 2026
5 checks passed

smalex-z deleted the fix/show-management-tunnels branch May 5, 2026 21:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/show management tunnels#86

Fix/show management tunnels#86
smalex-z merged 7 commits intomainfrom
fix/show-management-tunnels

smalex-z commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

gitguardian Bot commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

smalex-z commented May 5, 2026

Description

Changes Made

Testing

Screenshots (if applicable)

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

gitguardian Bot commented May 5, 2026

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant