Skip to content

Conversation

@luislavena
Copy link
Contributor

Hello!

This is a follow up to #131 and #152 that completes the usage of ssh CLI tool across all the possible connections.

Why this guy keeps insisting with shelling out to SSH? 🤔

Well, bear with me as I rant over the beauty of dealing with legacy and complicated network setups and security scenarios 😅

Sometimes machines that you need to SSH into might not have a direct connection to you: they might be within private networks or require different tricks to access them.

Non-publicly exposed machines

Certain nodes on a network might only have private IP addresses, but you can connect to them using SSH's built-in ProxyJump support, allowing you to use a machine that is exposed to the outside world and can talk to the internal network (also known as jump boxes or bastion servers).

That machine itself might use a different SSH key than the one you need to SSH into your node, because security! 🤡

You can say you could implement all this with Go's built-in library, but I don't think is Uncloud's main priority to duplicate that, document its usage and not to mention maintain it afterwards!

The simplest solution: leverage on existing tooling that does that, which happens to be ssh tool.

Authenticating with too many keys, public keys or anything other than private keys

In some security scenarios, your SSH agent might be hardened so you don't have access to private SSH keys locally, but instead leverage on tooling like 1Password SSH agent, Yubikey SSH keys or similar.

In the case of 1Password (and I guess similar to other password manager SSH agents), all your keys are exposed to the agent, which can cause the common too many authentication failures when trying to connect to a server and not indicating which key to use.

When using regular SSH, you can narrow down that list of SSH keys by using IdentitiesOnly and IdentityFile instructions in your SSH configuration.

But that is contradictory, you said you don't have private keys locally, how you could be using IdentityFile?

Well my friend, you can use the public part of an SSH key so it narrows down the list of private keys it will use.

From man ssh_config about IdentityFile:

You can also specify a public key file to use the corresponding private key that is loaded in ssh-agent(1) when the private key file is not present locally.

I know... magic ✨

And while working on open-source, I prefer the magic be managed by someone else than me, so, why not leverage on years of experience from the SSH developers and use it?

What this PR brings

The changes are very naive and might be repetitive, just duplicating current SSH functionality and allow machine init and machine add to shell out and use ssh to establish connection to the indicated node.

This of course records that into the configuration file using ssh+cli:// URI scheme discussed in #152 so we can make a distinction between regular, built-in Go SSH implementation (ssh:// or scheme-less usage by default) and this new implementation.

When I say naive implementation is because I've duplicated buildSSHArgs between the connector usage just for the purposes of building the Executor. That by itself is worth some refactoring.

I'm not proud of the branching paths in provisionOrConnectRemoteMachine or that each time sshexec.NewSSHCLIRemote is invoked, a new SSH client is spawned (instead of using a persistent one).

The performance impact was negligible when provisioning 3-5 machines, but it adds up from there.

Given that these commands aren't invoked all the time, I would say it will be an small price to pay when initializing a cluster or adding machines to it.

But as I learnt over the years: first make it right (aka: work), then make it fast 😊

My last nitpick was lack of e2e testing for this, but since I couldn't get the setup to work locally, I couldn't fully test it.

I will try to report on that once I have time.

Thank you again for creating Uncloud and making it available to others!
❤️ ❤️ ❤️

@spiffytech
Copy link

Using this PR, do I need do do anything special enable my ssh config's ProxyJump with uncloud?

$ ~/bin/uc machine init -c homelab-spfy ssh+cli://root@192.168.122.156 -i ~/.ssh/id_spiffytech
Error: SSH login to remote machine ssh+cli://root@192.168.122.156: connect using private key "/home/spiffytech/.ssh/id_spiffytech": dial tcp 192.168.122.156:22: i/o timeout

~/.ssh/config

Host 192.168.122.156
  ProxyJump 192.168.2.149
  Port 22

@spiffytech
Copy link

When I use --connect I get a SIGSEGV:

$ ~/bin/uc machine init -c homelab-spfy --connect ssh+cli://root@192.168.122.156 root@192.168.122.156
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1ca05a0]

goroutine 1 [running]:
github.com/psviderski/uncloud/internal/cli.(*CLI).newContextName(0xc000a8d2c0, {0x7ffcf924e716?, 0x0?})
        /workspace/internal/cli/cli.go:249 +0x40
github.com/psviderski/uncloud/internal/cli.(*CLI).initRemoteMachine(0xc000a8d2c0, {0x2a7b938, 0xc0002d12f0}, {{0x7ffcf924e716, 0xc}, {0x0, 0x0}, {{{0x0, 0xffff0ad20000}, {0xc0005a1128}},
...}, ...})
        /workspace/internal/cli/cli.go:164 +0x6b
github.com/psviderski/uncloud/internal/cli.(*CLI).InitCluster(0x7fc1b36d5bc0?, {0x2a7b938?, 0xc0002d12f0?}, {{0x7ffcf924e716, 0xc}, {0x0, 0x0}, {{{0x0, 0xffff0ad20000}, {0xc0005a1128}}, .
..}, ...})
        /workspace/internal/cli/cli.go:157 +0x65
github.com/psviderski/uncloud/cmd/uncloud/machine.initCluster({0x2a7b938, 0xc0002d12f0}, 0xc000a8d2c0, 0xc000389b40, {{0x245e714, 0x1a}, {0x0, 0x0}, {0xc000321900, 0xd}, ...})
        /workspace/cmd/uncloud/machine/init.go:137 +0x350
github.com/psviderski/uncloud/cmd/uncloud/machine.NewInitCommand.func1(0xc0006d2c08, {0xc000888eb0, 0x1, 0x241ddab?})
        /workspace/cmd/uncloud/machine/init.go:70 +0x20c
github.com/spf13/cobra.(*Command).execute(0xc0006d2c08, {0xc000888e60, 0x5, 0x5})
        /go/pkg/mod/github.com/spf13/cobra@v1.10.1/command.go:1015 +0xb02
github.com/spf13/cobra.(*Command).ExecuteC(0xc000672c08)
        /go/pkg/mod/github.com/spf13/cobra@v1.10.1/command.go:1148 +0x465
github.com/spf13/cobra.(*Command).Execute(...)
        /go/pkg/mod/github.com/spf13/cobra@v1.10.1/command.go:1071
main.main()
        /workspace/cmd/uncloud/main.go:97 +0x485

@luislavena
Copy link
Contributor Author

@spiffytech why you're using --connect with machine init? That SEGV is present in main too.

For the other error, seems you're getting a timeout, which shouldn't be the case (instead you should be getting the direct output of ssh -o ..., can you provide a more complete example of your ssh configuration and validate that normal ssh is working?

Thank you.

@spiffytech
Copy link

My ssh config has some unrelated host entries, but if I strip them out this is the whole file:

IdentityFile ~/.ssh/id_spiffytech

Host 192.168.2.149
  Port 2223

Host 192.168.122.156
  ProxyJump 192.168.2.149
  Port 22

I can run ssh root@192.168.122.156 and connect to the target machine without issue.

Here's the output of `ssh -v` if that helps you.

$ ssh -v root@192.168.122.156
OpenSSH_9.9p1, OpenSSL 3.2.6 30 Sep 2025
debug1: Reading configuration data /home/spiffytech/.ssh/config
debug1: /home/spiffytech/.ssh/config line 30: Applying options for 192.168.122.156
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/20-systemd-ssh-proxy.conf
debug1: Reading configuration data /etc/ssh/ssh_config.d/30-libvirt-ssh-proxy.conf
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: configuration requests final Match pass
debug1: re-parsing configuration
debug1: Reading configuration data /home/spiffytech/.ssh/config
debug1: /home/spiffytech/.ssh/config line 30: Applying options for 192.168.122.156
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/20-systemd-ssh-proxy.conf
debug1: Reading configuration data /etc/ssh/ssh_config.d/30-libvirt-ssh-proxy.conf
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: Setting implicit ProxyCommand from ProxyJump: ssh -v -W '[%h]:%p' 192.168.2.149
debug1: Executing proxy command: exec ssh -v -W '[192.168.122.156]:22' 192.168.2.149
debug1: identity file /home/spiffytech/.ssh/id_spiffytech type 0
debug1: identity file /home/spiffytech/.ssh/id_spiffytech-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_9.9
OpenSSH_9.9p1, OpenSSL 3.2.6 30 Sep 2025
debug1: Reading configuration data /home/spiffytech/.ssh/config
debug1: /home/spiffytech/.ssh/config line 27: Applying options for 192.168.2.149
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/20-systemd-ssh-proxy.conf
debug1: Reading configuration data /etc/ssh/ssh_config.d/30-libvirt-ssh-proxy.conf
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: configuration requests final Match pass
debug1: re-parsing configuration
debug1: Reading configuration data /home/spiffytech/.ssh/config
debug1: /home/spiffytech/.ssh/config line 27: Applying options for 192.168.2.149
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/20-systemd-ssh-proxy.conf
debug1: Reading configuration data /etc/ssh/ssh_config.d/30-libvirt-ssh-proxy.conf
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: Connecting to 192.168.2.149 [192.168.2.149] port 2223.
debug1: Connection established.
debug1: identity file /home/spiffytech/.ssh/id_spiffytech type 0
debug1: identity file /home/spiffytech/.ssh/id_spiffytech-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_9.9
debug1: Remote protocol version 2.0, remote software version OpenSSH_10.0p2 Debian-7
debug1: compat_banner: match: OpenSSH_10.0p2 Debian-7 pat OpenSSH* compat 0x04000000
debug1: Authenticating to 192.168.2.149:2223 as 'spiffytech'
debug1: load_hostkeys: fopen /home/spiffytech/.ssh/known_hosts2: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or directory
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: ssh-ed25519
debug1: kex: server->client cipher: aes256-gcm@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: aes256-gcm@openssh.com MAC: <implicit> compression: none
debug1: kex: curve25519-sha256 need=32 dh_need=32
debug1: kex: curve25519-sha256 need=32 dh_need=32
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: SSH2_MSG_KEX_ECDH_REPLY received
debug1: Server host key: ssh-ed25519 SHA256:5tHe8QvqioUIArXr4+PDFpyuukCl/M4wDutkyr1CQJI
debug1: load_hostkeys: fopen /home/spiffytech/.ssh/known_hosts2: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or directory
debug1: checking without port identifier
debug1: load_hostkeys: fopen /home/spiffytech/.ssh/known_hosts2: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or directory
debug1: Host '192.168.2.149' is known and matches the ED25519 host key.
debug1: Found key in /home/spiffytech/.ssh/known_hosts:229
debug1: found matching key w/out port
debug1: check_host_key: hostkey not known or explicitly trusted: disabling UpdateHostkeys
debug1: ssh_packet_send2_wrapped: resetting send seqnr 3
debug1: rekey out after 4294967296 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: Sending SSH2_MSG_EXT_INFO
debug1: expecting SSH2_MSG_NEWKEYS
debug1: ssh_packet_read_poll2: resetting read seqnr 3
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey in after 4294967296 blocks
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_ext_info_client_parse: server-sig-algs=<ssh-ed25519,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ssh-ed25519@openssh.com,sk-ecdsa-sha2-nistp256@openssh.com,rsa-sha2-512,rsa-sha2-256>
debug1: kex_ext_info_check_ver: publickey-hostbound@openssh.com=<0>
debug1: kex_ext_info_check_ver: ping@openssh.com=<0>
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_ext_info_client_parse: server-sig-algs=<ssh-ed25519,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ssh-ed25519@openssh.com,sk-ecdsa-sha2-nistp256@openssh.com,rsa-sha2-512,rsa-sha2-256>
debug1: Authentications that can continue: publickey
debug1: Next authentication method: publickey
debug1: get_agent_identities: bound agent to hostkey
debug1: get_agent_identities: agent returned 5 keys
debug1: Will attempt key: /home/spiffytech/.ssh/id_spiffytech RSA SHA256:B8hLj7cVhtdAGY5ZQeIt119QHZNeqQ0275dsctsr4VI explicit agent
debug1: Will attempt key: spiffytech@spiffytop ECDSA SHA256:NSL38uIpWA628vEbrSmMvaLwgEGGBuRuOQuHGl40p7c agent
debug1: Will attempt key: spiffytech@spiffytop ECDSA SHA256:vtIszG6JXvpUow6PuV7EEE0cM87tFat0zusbRW6OOY0 agent
debug1: Will attempt key: spiffytech@spiffytop ECDSA SHA256:VjsG3Ty/hVpJ84Srab+vudQBqAwy57GEfNNhSmN608A agent
debug1: Will attempt key: spiffytech@spiffytop ECDSA SHA256:Q5DZsEi47vxgpMYwXI+llDFbpUjb/FMARGG/bZgNHC8 agent
debug1: Offering public key: /home/spiffytech/.ssh/id_spiffytech RSA SHA256:B8hLj7cVhtdAGY5ZQeIt119QHZNeqQ0275dsctsr4VI explicit agent
debug1: Server accepts key: /home/spiffytech/.ssh/id_spiffytech RSA SHA256:B8hLj7cVhtdAGY5ZQeIt119QHZNeqQ0275dsctsr4VI explicit agent
Authenticated to 192.168.2.149 ([192.168.2.149]:2223) using "publickey".
debug1: pkcs11_del_provider: called, provider_id = (null)
debug1: channel_connect_stdio_fwd: 192.168.122.156:22
debug1: channel 0: new stdio-forward [stdio-forward] (inactive timeout: 0)
debug1: Requesting no-more-sessions@openssh.com
debug1: Entering interactive session.
debug1: pledge: network
debug1: pledge: fork
debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0
debug1: Remote: /home/spiffytech/.ssh/authorized_keys:1: key options: agent-forwarding port-forwarding pty user-rc x11-forwarding
debug1: Remote: /home/spiffytech/.ssh/authorized_keys:1: key options: agent-forwarding port-forwarding pty user-rc x11-forwarding
debug1: Remote protocol version 2.0, remote software version OpenSSH_10.0p2 Debian-7
debug1: compat_banner: match: OpenSSH_10.0p2 Debian-7 pat OpenSSH* compat 0x04000000
debug1: Authenticating to 192.168.122.156:22 as 'root'
debug1: load_hostkeys: fopen /home/spiffytech/.ssh/known_hosts2: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or directory
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: ssh-ed25519
debug1: kex: server->client cipher: aes256-gcm@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: aes256-gcm@openssh.com MAC: <implicit> compression: none
debug1: kex: curve25519-sha256 need=32 dh_need=32
debug1: kex: curve25519-sha256 need=32 dh_need=32
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: SSH2_MSG_KEX_ECDH_REPLY received
debug1: Server host key: ssh-ed25519 SHA256:8SXXQOI/cx5SXKGfaeEK4X2Fuv/yG2zFDx9RdsX9bLw
debug1: load_hostkeys: fopen /home/spiffytech/.ssh/known_hosts2: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or directory
debug1: Host '192.168.122.156' is known and matches the ED25519 host key.
debug1: Found key in /home/spiffytech/.ssh/known_hosts:238
debug1: ssh_packet_send2_wrapped: resetting send seqnr 3
debug1: rekey out after 4294967296 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: Sending SSH2_MSG_EXT_INFO
debug1: expecting SSH2_MSG_NEWKEYS
debug1: ssh_packet_read_poll2: resetting read seqnr 3
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey in after 4294967296 blocks
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_ext_info_client_parse: server-sig-algs=<ssh-ed25519,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ssh-ed25519@openssh.com,sk-ecdsa-sha2-nistp256@openssh.com,rsa-sha2-512,rsa-sha2-256>
debug1: kex_ext_info_check_ver: publickey-hostbound@openssh.com=<0>
debug1: kex_ext_info_check_ver: ping@openssh.com=<0>
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_ext_info_client_parse: server-sig-algs=<ssh-ed25519,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ssh-ed25519@openssh.com,sk-ecdsa-sha2-nistp256@openssh.com,rsa-sha2-512,rsa-sha2-256>
debug1: Authentications that can continue: publickey
debug1: Next authentication method: publickey
debug1: get_agent_identities: bound agent to hostkey
debug1: get_agent_identities: agent returned 5 keys
debug1: Will attempt key: /home/spiffytech/.ssh/id_spiffytech RSA SHA256:B8hLj7cVhtdAGY5ZQeIt119QHZNeqQ0275dsctsr4VI explicit agent
debug1: Will attempt key: spiffytech@spiffytop ECDSA SHA256:NSL38uIpWA628vEbrSmMvaLwgEGGBuRuOQuHGl40p7c agent
debug1: Will attempt key: spiffytech@spiffytop ECDSA SHA256:vtIszG6JXvpUow6PuV7EEE0cM87tFat0zusbRW6OOY0 agent
debug1: Will attempt key: spiffytech@spiffytop ECDSA SHA256:VjsG3Ty/hVpJ84Srab+vudQBqAwy57GEfNNhSmN608A agent
debug1: Will attempt key: spiffytech@spiffytop ECDSA SHA256:Q5DZsEi47vxgpMYwXI+llDFbpUjb/FMARGG/bZgNHC8 agent
debug1: Offering public key: /home/spiffytech/.ssh/id_spiffytech RSA SHA256:B8hLj7cVhtdAGY5ZQeIt119QHZNeqQ0275dsctsr4VI explicit agent
debug1: Server accepts key: /home/spiffytech/.ssh/id_spiffytech RSA SHA256:B8hLj7cVhtdAGY5ZQeIt119QHZNeqQ0275dsctsr4VI explicit agent
Authenticated to 192.168.122.156 (via proxy) using "publickey".
debug1: pkcs11_del_provider: called, provider_id = (null)
debug1: channel 0: new session [client-session] (inactive timeout: 0)
debug1: Requesting no-more-sessions@openssh.com
debug1: Entering interactive session.
debug1: pledge: filesystem
debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0
debug1: client_input_hostkeys: searching /home/spiffytech/.ssh/known_hosts for 192.168.122.156 / (none)
debug1: client_input_hostkeys: searching /home/spiffytech/.ssh/known_hosts2 for 192.168.122.156 / (none)
debug1: client_input_hostkeys: hostkeys file /home/spiffytech/.ssh/known_hosts2 does not exist
debug1: client_input_hostkeys: host key found matching a different name/address, skipping UserKnownHostsFile update
debug1: Remote: /root/.ssh/authorized_keys:1: key options: agent-forwarding port-forwarding pty user-rc x11-forwarding
debug1: Remote: /root/.ssh/authorized_keys:1: key options: agent-forwarding port-forwarding pty user-rc x11-forwarding
debug1: pledge: fork
Linux homelab-spfy-1 6.12.48+deb13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.48-1 (2025-09-20) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Sun Nov  9 19:58:53 2025 from 192.168.122.1

Copy link
Owner

@psviderski psviderski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! 👍

Just one tiny comment about deleted examples for machine init command

# Initialise without Caddy (no reverse proxy) and without an automatically managed domain name (xxxxxx.cluster.uncloud.run).
# You can deploy Caddy with 'uc caddy deploy' and reserve a domain with 'uc dns reserve' later.
uc machine init root@<your-server-ip> --no-caddy --no-dns`,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was removing these examples intentional? It would be nice to keep them I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, copy & pasta mistake, I copied from add the Long reference and removed this one by mistake.

I believe I need to put some blocking to git commit and git push after certain hour of the day 😞

Restoring it in the next commit, apologies! 😅

@psviderski
Copy link
Owner

@spiffytech

$ ~/bin/uc machine init -c homelab-spfy ssh+cli://root@192.168.122.156 -i ~/.ssh/id_spiffytech
Error: SSH login to remote machine ssh+cli://root@192.168.122.156: connect using private key "/home/spiffytech/.ssh/id_spiffytech": dial tcp 192.168.122.156:22: i/o timeout

The error Error: SSH login to remote machine: ... suggests that you're likely not running the code in this PR. Can you please double-check you checked out this PR before building the uncloud/uc binary?

@spiffytech
Copy link

D'oh! I checked out the repo but forgot to switch to the branch.

Yeah, now I can connect to my server 🎉

I do get this error at the end of 'machine init', not sure if it's related to anything going on here.

Error: inspect machine: rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: read |0: file already closed"
Full output
$ ~/bin/uc machine init -c homelab-spfy ssh+cli://root@192.168.122.156
Downloading Uncloud install script: https://raw.githubusercontent.com/psviderski/uncloud/refs/heads/main/scripts/install.sh
⏳ Running Uncloud install script...
⏳ Installing Docker...
# Executing docker install script, commit: e3bd92d5b36b59b39661e4e6d05c786db9bb3ad7
+ sh -c apt-get -qq update >/dev/null
+ sh -c DEBIAN_FRONTEND=noninteractive apt-get -y -qq install ca-certificates curl >/dev/null
+ sh -c install -m 0755 -d /etc/apt/keyrings
+ sh -c curl -fsSL "https://download.docker.com/linux/debian/gpg" -o /etc/apt/keyrings/docker.asc
+ sh -c chmod a+r /etc/apt/keyrings/docker.asc
+ sh -c echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian trixie stable" > /etc/apt/sources.list.d/docker.list
+ sh -c apt-get -qq update >/dev/null
+ sh -c DEBIAN_FRONTEND=noninteractive apt-get -y -qq install docker-ce docker-ce-cli containerd.io docker-compose-plugin docker-ce-rootless-extras docker-buildx-plugin docker-model-plugin >/dev/null
Client: Docker Engine - Community
 Version:           28.5.2
 API version:       1.51
 Go version:        go1.25.3
 Git commit:        ecc6942
 Built:             Wed Nov  5 14:43:33 2025
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          28.5.2
  API version:      1.51 (minimum version 1.24)
  Go version:       go1.25.3
  Git commit:       89c5e8f
  Built:            Wed Nov  5 14:43:33 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.7.29
  GitCommit:        442cb34bda9a6a0fed82a2ca7cade05c5c749582
 runc:
  Version:          1.3.3
  GitCommit:        v1.3.3-0-gd842d771
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

================================================================================

To run Docker as a non-privileged user, consider setting up the
Docker daemon in rootless mode for your user:

    dockerd-rootless-setuptool.sh install

Visit https://docs.docker.com/go/rootless/ to learn about rootless mode.


To run the Docker daemon as a fully privileged service, but granting non-root
users access, refer to https://docs.docker.com/go/daemon-access/

WARNING: Access to the remote API on a privileged Docker daemon is equivalent
         to root access on the host. Refer to the 'Docker daemon attack surface'
         documentation for details: https://docs.docker.com/go/attack-surface/

================================================================================

⏳ Configuring Docker daemon (/etc/docker/daemon.json) to optimise it for Uncloud...
+ sh -c docker version
✓ Docker installed and configured successfully.
✓ Linux user and group 'uncloud' created.
⏳ Installing Uncloud binaries...
⏳ Downloading uncloudd binary: https://github.com/psviderski/uncloud/releases/latest/download/uncloudd_linux_amd64.tar.gz
✓ uncloudd binary installed: /usr/local/bin/uncloudd
⏳ Downloading uninstall script: https://raw.githubusercontent.com/psviderski/uncloud/refs/heads/main/scripts/uninstall.sh
✓ uncloud-uninstall script installed: /usr/local/bin/uncloud-uninstall
✓ Systemd unit file created: /etc/systemd/system/uncloud.service
Created symlink '/etc/systemd/system/multi-user.target.wants/uncloud.service' → '/etc/systemd/system/uncloud.service'.
⏳ Downloading uncloud-corrosion binary: https://github.com/psviderski/corrosion/releases/download/v0.2.2/corrosion-x86_64-unknown-linux-gnu.tar.gz
✓ uncloud-corrosion binary installed: /usr/local/bin/uncloud-corrosion
✓ Systemd unit file created: /etc/systemd/system/uncloud-corrosion.service
⏳ Starting Uncloud machine daemon (uncloud.service)...
✓ Uncloud machine daemon started.
✓ Uncloud installed on the machine successfully! 🎉
Error: inspect machine: rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: read |0: file already closed"

@psviderski
Copy link
Owner

Thank you for providing details. I haven't seen this error before so likely related to the new ssh+cli connection type.
Did you get that error right after "✓ Uncloud installed on the machine successfully! 🎉" or did you notice any delay before the command failed with an error?
Can you try running the same init again (you can confirm to reset it once it prompts) to see if this error persists

@spiffytech
Copy link

Yep, if I init the same machine again, I continue getting the error. There is an ~18 second delay in between "Uncloud installed" and the error.

(I was not prompted for anything when I reran init).

@psviderski
Copy link
Owner

This looks like the SSH connection terminates before the RPC call over it completes. Maybe we need to configure something to keep the connection alive, tweak timeouts, etc. But would be good to first try to consistently reproduce this and clearly understand why this happens.

@psviderski
Copy link
Owner

Ah, I was able to reproduce this locally. The ssh+cli targets also require the change in uncloudd daemon on machines that is in main but hasn't been released.
You can build the daemon part as well:

go build -o uncloudd ./cmd/uncloudd
# set the required arch if needed:
GOOS=linux GOARCH=amd64 go build -o uncloudd ./cmd/uncloudd

and then scp to your server to /usr/local/bin/uncloudd and restart systemctl restart uncloud. Then run machine init again which will use the upgraded daemon (not try override it).

Or you can just wait ~a day. I'm working on the 0.14 release right now

@spiffytech
Copy link

Confirmed, that gets the init to succeed.

I did still hit the same error when performing a reset. Once the command timed out, I inited again and it was fine.

Logs
$ ~/bin/uc machine init ssh+cli://root@192.168.122.156 --no-dns --public-ip none
Downloading Uncloud install script: https://raw.githubusercontent.com/psviderski/uncloud/refs/heads/main/scripts/install.sh
⏳ Running Uncloud install script...
✓ Docker is already installed.
Client: Docker Engine - Community
 Version:           28.5.2
 API version:       1.51
 Go version:        go1.25.3
 Git commit:        ecc6942
 Built:             Wed Nov  5 14:43:33 2025
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          28.5.2
  API version:      1.51 (minimum version 1.24)
  Go version:       go1.25.3
  Git commit:       89c5e8f
  Built:            Wed Nov  5 14:43:33 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.7.29
  GitCommit:        442cb34bda9a6a0fed82a2ca7cade05c5c749582
 runc:
  Version:          1.3.3
  GitCommit:        v1.3.3-0-gd842d771
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
✓ Linux user 'uncloud' already exists.
✓ uncloudd binary is already installed.
✓ Systemd unit file created: /etc/systemd/system/uncloud.service
✓ uncloud-corrosion binary is already installed.
✓ Systemd unit file created: /etc/systemd/system/uncloud-corrosion.service
⏳ Starting Uncloud machine daemon (uncloud.service)...
✓ Uncloud machine daemon started.
✓ Uncloud installed on the machine successfully! 🎉
The remote machine is already initialised as a cluster member. Do you want to reset it first?
This will:
- Remove all service containers from the machine
- Reset the machine to the uninitialised state

Choose [y/N]: y
Chose: Yes!

Resetting the remote machine...
Error: wait for machine to be ready after reset: inspect machine: rpc error: code = Unavailable desc = connection error: desc = "transport: failed to write client preface: write |1: file already closed"

Fixes copy & pasta mistake when documenting the new ssh+cli connection
methods.
@luislavena
Copy link
Contributor Author

I did still hit the same error when performing a reset. Once the command timed out, I inited again and it was fine.

Taking a look to this now 👀

@luislavena
Copy link
Contributor Author

I did still hit the same error when performing a reset.

Found the issue, and I might need assistance @psviderski on how to proceed (or if I missed something)

When doing Reset() (promptResetMachine), we already have a reference to existing machineClient, but reset will stop the daemon.

Then we wait for the machine at waitMachineReady using the same machineClient, but we already severed the connection to the daemon, as the connection using dial-stdio is already gone.

At this point we might need to re-establish the connection.

We cannot remove the waiting because we depend on systemd to restart the service for us, and that does not happen instantly, so we need some wait.

Right now, trying the following change:

diff --git a/internal/cli/cli.go b/internal/cli/cli.go
index 9ef6bd7..8d83a01 100644
--- a/internal/cli/cli.go
+++ b/internal/cli/cli.go
@@ -186,6 +186,14 @@ func (cli *CLI) initRemoteMachine(ctx context.Context, opts InitClusterOptions)
 		if err = promptResetMachine(ctx, machineClient.MachineClient); err != nil {
 			return nil, err
 		}
+		// After a reset, the daemon restarts and closes dial-stdio connection,
+		// so we need to reconnect.
+		machineClient.Close()
+		fmt.Println("Reconnecting to machine after reset...")
+		machineClient, err = provisionOrConnectRemoteMachine(ctx, opts.RemoteMachine, true, opts.Version)
+		if err != nil {
+			return nil, fmt.Errorf("reconnect to machine after reset: %w", err)
+		}
 	}
 
 	// Check machine meets all necessary system requirements before proceeding.
diff --git a/internal/cli/machine.go b/internal/cli/machine.go
index 8c57408..3dc9fe4 100644
--- a/internal/cli/machine.go
+++ b/internal/cli/machine.go
@@ -114,10 +114,6 @@ func promptResetMachine(ctx context.Context, machineClient pb.MachineClient) err
 		return fmt.Errorf("reset remote machine: %w. You can also manually run 'uncloud-uninstall' "+
 			"on the remote machine to fully uninstall Uncloud from it", err)
 	}
-	fmt.Println("Resetting the remote machine...")
-	if err := waitMachineReady(ctx, machineClient, 1*time.Minute); err != nil {
-		return fmt.Errorf("wait for machine to be ready after reset: %w", err)
-	}
 
 	return nil
 }

But it will fail quickly:

$ ./uncloud --uncloud-config ./blatta.yaml machine init ssh+cli://provision@blatta11 --public-ip none --no-dns --name blatta11 --no-install
The remote machine is already initialised as a cluster member. Do you want to reset it first?
This will:
- Remove all service containers from the machine
- Reset the machine to the uninitialised state

Choose [y/N]: y
Chose: Yes!

Reconnecting to machine after reset...
Error: check machine prerequisites: rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: command [ssh -o ConnectTimeout=5 provision@blatta11 uncloudd dial-stdio] has exited with exit status 1, make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=Error: connect to socket \"/run/uncloud/uncloud.sock\": dial unix /run/uncloud/uncloud.sock: connect: no such file or directory\n"

And 5 seconds later it works:

$ ./uncloud --uncloud-config ./blatta.yaml machine init ssh+cli://provision@blatta11 --public-ip none --no-dns --name blatta11 --no-install
Cluster initialised with machine 'blatta11' and saved as context 'default' in your local config (./blatta.yaml)
Current cluster context is now 'default'.
Waiting for the machine to be ready...

[+] Deploying service caddy 1/1
 ✔ Container caddy-jh4l on blatta11  Started                                                                                                                             1.2s

Skipping DNS records update as no cluster domain is reserved (see 'uc dns').

So perhaps waitMachineReady will need to be updated to try to SSH dial-stdio and catch the errors?

@luislavena
Copy link
Contributor Author

luislavena commented Nov 17, 2025

Quick and dirty patch just trying to get init and add to work after uncloud service is restarted on the remote machine:

(warning: half Claude generated, half modified by me remembering Go, so any bug of this is most likely on me).

diff --git a/internal/cli/cli.go b/internal/cli/cli.go
index 9ef6bd7..989a6d6 100644
--- a/internal/cli/cli.go
+++ b/internal/cli/cli.go
@@ -172,7 +172,7 @@ func (cli *CLI) initRemoteMachine(ctx context.Context, opts InitClusterOptions)
 	}
 	// Ensure machineClient is closed on error.
 	defer func() {
-		if err != nil {
+		if err != nil && machineClient != nil {
 			machineClient.Close()
 		}
 	}()
@@ -183,7 +183,8 @@ func (cli *CLI) initRemoteMachine(ctx context.Context, opts InitClusterOptions)
 		return nil, fmt.Errorf("inspect machine: %w", err)
 	}
 	if minfo.Id != "" {
-		if err = promptResetMachine(ctx, machineClient.MachineClient); err != nil {
+		machineClient, err = promptResetMachine(ctx, machineClient, opts.RemoteMachine)
+		if err != nil {
 			return nil, err
 		}
 	}
@@ -301,7 +302,7 @@ func (cli *CLI) AddMachine(ctx context.Context, opts AddMachineOptions) (*client
 		return nil, nil, err
 	}
 	defer func() {
-		if err != nil {
+		if err != nil && machineClient != nil {
 			machineClient.Close()
 		}
 	}()
@@ -323,7 +324,8 @@ func (cli *CLI) AddMachine(ctx context.Context, opts AddMachineOptions) (*client
 			return nil, nil, fmt.Errorf("machine is already a member of this cluster (%s)", minfo.Name)
 		}
 
-		if err = promptResetMachine(ctx, machineClient.MachineClient); err != nil {
+		machineClient, err = promptResetMachine(ctx, machineClient, opts.RemoteMachine)
+		if err != nil {
 			return nil, nil, err
 		}
 	}
diff --git a/internal/cli/machine.go b/internal/cli/machine.go
index 8c57408..4a3945b 100644
--- a/internal/cli/machine.go
+++ b/internal/cli/machine.go
@@ -11,6 +11,7 @@ import (
 	"github.com/charmbracelet/huh"
 	"github.com/psviderski/uncloud/internal/machine/api/pb"
 	"github.com/psviderski/uncloud/internal/sshexec"
+	"github.com/psviderski/uncloud/pkg/client"
 	"google.golang.org/protobuf/types/known/emptypb"
 )
 
@@ -86,7 +87,9 @@ func provisionMachine(ctx context.Context, exec sshexec.Executor, version string
 	return nil
 }
 
-func promptResetMachine(ctx context.Context, machineClient pb.MachineClient) error {
+func promptResetMachine(
+	ctx context.Context, oldClient *client.Client, remoteMachine *RemoteMachine,
+) (*client.Client, error) {
 	var confirm bool
 	form := huh.NewForm(
 		huh.NewGroup(
@@ -103,23 +106,57 @@ func promptResetMachine(ctx context.Context, machineClient pb.MachineClient) err
 		),
 	).WithAccessible(true)
 	if err := form.Run(); err != nil {
-		return fmt.Errorf("prompt user to confirm: %w", err)
+		return nil, fmt.Errorf("prompt user to confirm: %w", err)
 	}
 
 	if !confirm {
-		return fmt.Errorf("remote machine is already initialised as a cluster member")
+		return nil, fmt.Errorf("remote machine is already initialised as a cluster member")
 	}
 
-	if _, err := machineClient.Reset(ctx, &pb.ResetRequest{}); err != nil {
-		return fmt.Errorf("reset remote machine: %w. You can also manually run 'uncloud-uninstall' "+
+	if _, err := oldClient.Reset(ctx, &pb.ResetRequest{}); err != nil {
+		return nil, fmt.Errorf("reset remote machine: %w. You can also manually run 'uncloud-uninstall' "+
 			"on the remote machine to fully uninstall Uncloud from it", err)
 	}
+
+	// Close the old connection since the reset will stop uncloudd, making the connection invalid
+	// (especially for SSH CLI connections where the dial-stdio pipe closes).
+	oldClient.Close()
+
 	fmt.Println("Resetting the remote machine...")
-	if err := waitMachineReady(ctx, machineClient, 1*time.Minute); err != nil {
-		return fmt.Errorf("wait for machine to be ready after reset: %w", err)
+
+	// Reconnect to the remote machine after reset with retry logic. We need to retry the
+	// reconnection itself (not just the Inspect call) because uncloudd might still be shutting down.
+	// If we connect during shutdown, we get a connection to a dying process that fails immediately.
+	boff := backoff.WithContext(backoff.NewExponentialBackOff(
+		backoff.WithMaxInterval(1*time.Second),
+		backoff.WithMaxElapsedTime(1*time.Minute),
+	), ctx)
+
+	var newClient *client.Client
+	reconnect := func() error {
+		// Attempt to reconnect. The skipInstall parameter is true because the machine is already provisioned.
+		var err error
+		newClient, err = provisionOrConnectRemoteMachine(ctx, remoteMachine, true, "")
+		if err != nil {
+			return fmt.Errorf("connect to remote machine: %w", err)
+		}
+
+		// Verify the connection works by attempting an Inspect call.
+		// This ensures uncloudd has fully restarted and is serving requests.
+		if _, err := newClient.Inspect(ctx, &emptypb.Empty{}); err != nil {
+			newClient.Close()
+			newClient = nil
+			return fmt.Errorf("verify connection: %w", err)
+		}
+
+		return nil
 	}
 
-	return nil
+	if err := backoff.Retry(reconnect, boff); err != nil {
+		return nil, fmt.Errorf("reconnect to remote machine after reset: %w", err)
+	}
+
+	return newClient, nil
 }
 
 // waitMachineReady waits for the machine to be ready to serve requests.

but this renders waitMachineReady obsolete (this code retries the connection, not just the Inspect call).

I saw waitMachineReady being used by ucind and waiting for cluster state, but not sure if we should keep both 🤔

@psviderski
Copy link
Owner

Let's merge this PR and discuss the fix for machine init/add in a new one. It's quite hard to review the patches above without being able to explore the code around them.

Regarding retries and reuse of the machineClient, it seems to be working with the SSHConnector, right? I think this is because we configure the grpc client with a dialer function that actually tries to create a new connection using DialContext every time it's called:

grpc.WithContextDialer(
func(ctx context.Context, addr string) (net.Conn, error) {
addr = strings.TrimPrefix(addr, "unix://")
conn, dErr := c.client.DialContext(ctx, "unix", addr)
if dErr != nil {
return nil, fmt.Errorf(
"connect to machine API socket '%s' through SSH tunnel (is uncloud.service running "+
"on the remote machine and does the SSH user '%s' have permissions to access the socket?):"+
" %w",
addr, c.client.User(), dErr,
)
}
return conn, nil
},
),

While SSHCLIConnector tries to connect only once and then returns the same connection:

grpc.WithContextDialer(func(ctx context.Context, _ string) (net.Conn, error) {
return c.conn, nil
}),
.
I think we can update the grpc.WithContextDialer function to also try to create a new connection with ssh ... uncloudd dial-stdio so that the existing code will work the same way it works with SSHConnector if I'm not missing anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants