Skip to content

Fix/show management tunnels#86

Merged
smalex-z merged 7 commits intomainfrom
fix/show-management-tunnels
May 5, 2026
Merged

Fix/show management tunnels#86
smalex-z merged 7 commits intomainfrom
fix/show-management-tunnels

Conversation

@smalex-z
Copy link
Copy Markdown
Owner

@smalex-z smalex-z commented May 5, 2026

Description

Resolves #__
Supports #84

Changes Made

  • Completely reworks client side architecture and requires a client golang binary now.

Testing

  • Tested locally
  • Added/updated unit tests
  • All tests passing

Screenshots (if applicable)

smalex-z added 7 commits May 5, 2026 06:07
The agent control plane runs as a private rathole service entry per
machine — same shape as the existing machine-ssh tunnel but bound to
127.0.0.1 on both sides. Until now it was hidden from the dashboard,
making it confusing when operators noticed an extra port on the VPS.

Synthesizes a Tunnel row with Kind: "machine-agent" alongside the
existing machine-ssh entry whenever the machine has agent fields
populated. Status comes from AgentLastSeen freshness (active within
two missed health polls = active, otherwise offline; pending before
the first poll lands).

Tunnels page sort now pins management entries (SSH, then agent) to
the top of each machine's group so user tunnels stay together below
them. Managed=true is already in place so the existing protected-
tunnel guard prevents accidental deletion.
…up + notifies server

Three things I missed in the agent rollout:

1. Bootstrap.sh was still installing the agent as User=$SSH_USER with no
   gopher system user and no /etc/sudoers.d/gopher entry. Migrate.sh on
   existing machines created the gopher user with NOPASSWD: ALL, but
   fresh bootstraps ended up in a different shape — inconsistent.
   Bootstrap now mirrors migrate.sh exactly: creates the gopher user,
   writes the sudoers rule, chowns /etc/rathole/client.toml + the agent
   config to gopher, and installs the systemd unit with User=gopher.

2. gopher-uninstall.sh had leftover "user-mode" cleanup (kill ~/bin/gopher-
   agent, strip user crontab) from an abandoned design path, AND it never
   removed the gopher system user. Stripped the dead code; added userdel
   gopher AFTER the agent service is stopped; sudoers cleanup runs last
   so prior steps could still use sudo.

3. When an operator runs gopher-uninstall locally on a client, the
   dashboard's machine list went stale — there was no callback to delete
   the server-side record. Now the script POSTs to /api/machines/self-
   delete with its agent token before tearing down. The endpoint resolves
   the token to a machine via db.GetMachineByAgentToken and calls
   MachineService.DeleteFromClient — a new variant of Delete that skips
   the remote-uninstall step (we're already running it locally) but
   still does server-side cleanup (tunnels, Caddy, rathole reconcile,
   machine row).

The notification is best-effort: failures (no curl, expired DNS, server
unreachable) leave the local cleanup proceeding normally — the operator
can still delete from the dashboard manually.
You're right: the dashboard kept reporting machines as "connected" even
after rathole-client was removed, as long as the gopher binary was alive.

The agent's status endpoint already reports rathole.Active separately,
but checkViaAgent's only response to "agent up, rathole down" was to log
a failed health check and try recovery — it never updated machine.Status.
Combined with monitor.go skipping agent-installed machines, nothing flipped
the status off "connected" from the last good poll.

Now when the agent answers but reports rathole inactive, we explicitly
flip the machine to offline via a new SetMachineAgentDegraded helper.
AgentLastSeen still updates (the back-channel works), but Status reflects
the actual tunnel-serving capability. Synthesized SSH tunnel rows derive
from machine.Status and will correctly show offline once this lands.

The tunnel-status path (monitor.go's checkTunnels) already TCP-probes
rathole bind ports independently, so per-tunnel statuses were correct —
this fix is specifically about machine.Status.
…ootstrap grace window

Three real bugs from the same agent rollout:

1. The agent's systemd unit defaulted to KillMode=control-group, which
   means systemctl-stopping gopher-agent kills its entire cgroup —
   including the detached gopher-uninstall worker spawned from
   POST /uninstall. The script gets murdered partway through cleanup.
   Both bootstrap.sh and migrate.sh now set KillMode=process so only
   the main agent dies and children continue.

   This also explains why "delete machine" on the dashboard appeared
   to do nothing on the client side: the cleanup STARTED but got
   killed before it could finish (or before it got to the self-rm
   line, which is why gopher-uninstall didn't delete itself either).

2. gopher-uninstall.sh's self-destruct line used plain `rm -f` instead
   of `$SUDO rm -f`. When invoked as root via `sudo gopher-uninstall`
   that worked, but if the script ever ran without sudo elevation
   (or got partially killed before reaching it) the binary survived.
   Added $SUDO and moved it BEFORE the sudoers cleanup so the
   privilege is still in scope.

3. The migration banner showed "agent isn't set up" for machines that
   were freshly bootstrapped — agent_installed=false until the first
   successful health poll (~60s after bootstrap). MachinesWithoutAgent
   now excludes machines under 10 minutes old. Bootstrap inline-installs
   the agent + the health service polls every 60s, so any machine still
   missing the agent flag after 10 minutes is a real problem; before
   that, it's just installation latency.
NextSSHTunnelPort() and NextRatholePort() were line-for-line identical
(both walked allUsedPorts() from 1024 looking for the first gap), and
bootstrap.go called them back-to-back with no DB write in between —
so they returned the same port. The Machine row ended up with
TunnelPort == AgentRemotePort, rathole-server tried to bind two
services to the same address, and the back-channel was permanently
broken on every freshly bootstrapped machine.

NextRatholePort now takes a variadic excluding list. Bootstrap passes
the SSH tunnel port to the second call so it can't be reused for the
agent. NextSSHTunnelPort is removed — it was a duplicate name for the
same function, and consolidating prevents this footgun from coming
back later. Added a regression test that fails on the old behavior.
Both pages had refetchInterval set to false, so changes driven by the
health service (60s poll loop) and monitor (30s TCP probes) only became
visible after a manual refresh. Network Map already polled at 30s; now
the rest of the dashboard matches.

Machines page keeps its 3s burst-refresh during bootstrap-waiting so the
"machine registered!" success state still flips fast — only the
steady-state behavior changes from "static" to "30s".
Bumped steady-state refresh on Machines + Tunnels pages from 30s to 15s
per request, plus a 5s middle tier on the Machines page while any
machine is fresh (created < 5 min ago) or still in "pending" status.
The post-bootstrap window is exactly when status flips happen fastest
— rathole connecting, agent installing inline, first health poll
landing — so 5s polling there means the operator sees the machine
go pending → connected → agent-installed without manual refresh.

refetchInterval is computed via the function form of react-query so
the cadence self-adjusts: once every machine has settled and aged past
5 minutes, polling drops back to 15s automatically. No timers, no
state, just a derived rate from the current data.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Unit tests run: 275
Unit tests passed: 275
Test coverage: 25.6%

@gitguardian
Copy link
Copy Markdown

gitguardian Bot commented May 5, 2026

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
32478099 Triggered Generic Password 89b8243 internal/api/handlers/templates/bootstrap.sh View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@smalex-z smalex-z merged commit 56c839b into main May 5, 2026
5 checks passed
@smalex-z smalex-z deleted the fix/show-management-tunnels branch May 5, 2026 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant