-
-
Notifications
You must be signed in to change notification settings - Fork 116
Support SSH CLI on machine init and add commands #173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support SSH CLI on machine init and add commands #173
Conversation
Implement's Executor interface (Run, Stream, Close)
Allow us to differentiate regular ssh from ssh+cli.
|
Using this PR, do I need do do anything special enable my ssh config's ~/.ssh/config |
|
When I use |
|
@spiffytech why you're using For the other error, seems you're getting a timeout, which shouldn't be the case (instead you should be getting the direct output of Thank you. |
|
My ssh config has some unrelated host entries, but if I strip them out this is the whole file: I can run Here's the output of `ssh -v` if that helps you. |
psviderski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! 👍
Just one tiny comment about deleted examples for machine init command
| # Initialise without Caddy (no reverse proxy) and without an automatically managed domain name (xxxxxx.cluster.uncloud.run). | ||
| # You can deploy Caddy with 'uc caddy deploy' and reserve a domain with 'uc dns reserve' later. | ||
| uc machine init root@<your-server-ip> --no-caddy --no-dns`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was removing these examples intentional? It would be nice to keep them I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, copy & pasta mistake, I copied from add the Long reference and removed this one by mistake.
I believe I need to put some blocking to git commit and git push after certain hour of the day 😞
Restoring it in the next commit, apologies! 😅
The error |
|
D'oh! I checked out the repo but forgot to switch to the branch. Yeah, now I can connect to my server 🎉 I do get this error at the end of 'machine init', not sure if it's related to anything going on here. Full output |
|
Thank you for providing details. I haven't seen this error before so likely related to the new ssh+cli connection type. |
|
Yep, if I init the same machine again, I continue getting the error. There is an ~18 second delay in between "Uncloud installed" and the error. (I was not prompted for anything when I reran init). |
|
This looks like the SSH connection terminates before the RPC call over it completes. Maybe we need to configure something to keep the connection alive, tweak timeouts, etc. But would be good to first try to consistently reproduce this and clearly understand why this happens. |
|
Ah, I was able to reproduce this locally. The and then scp to your server to Or you can just wait ~a day. I'm working on the 0.14 release right now |
|
Confirmed, that gets the init to succeed. I did still hit the same error when performing a reset. Once the command timed out, I inited again and it was fine. Logs |
Fixes copy & pasta mistake when documenting the new ssh+cli connection methods.
Taking a look to this now 👀 |
Found the issue, and I might need assistance @psviderski on how to proceed (or if I missed something) When doing Then we wait for the machine at At this point we might need to re-establish the connection. We cannot remove the waiting because we depend on systemd to restart the service for us, and that does not happen instantly, so we need some wait. Right now, trying the following change: diff --git a/internal/cli/cli.go b/internal/cli/cli.go
index 9ef6bd7..8d83a01 100644
--- a/internal/cli/cli.go
+++ b/internal/cli/cli.go
@@ -186,6 +186,14 @@ func (cli *CLI) initRemoteMachine(ctx context.Context, opts InitClusterOptions)
if err = promptResetMachine(ctx, machineClient.MachineClient); err != nil {
return nil, err
}
+ // After a reset, the daemon restarts and closes dial-stdio connection,
+ // so we need to reconnect.
+ machineClient.Close()
+ fmt.Println("Reconnecting to machine after reset...")
+ machineClient, err = provisionOrConnectRemoteMachine(ctx, opts.RemoteMachine, true, opts.Version)
+ if err != nil {
+ return nil, fmt.Errorf("reconnect to machine after reset: %w", err)
+ }
}
// Check machine meets all necessary system requirements before proceeding.
diff --git a/internal/cli/machine.go b/internal/cli/machine.go
index 8c57408..3dc9fe4 100644
--- a/internal/cli/machine.go
+++ b/internal/cli/machine.go
@@ -114,10 +114,6 @@ func promptResetMachine(ctx context.Context, machineClient pb.MachineClient) err
return fmt.Errorf("reset remote machine: %w. You can also manually run 'uncloud-uninstall' "+
"on the remote machine to fully uninstall Uncloud from it", err)
}
- fmt.Println("Resetting the remote machine...")
- if err := waitMachineReady(ctx, machineClient, 1*time.Minute); err != nil {
- return fmt.Errorf("wait for machine to be ready after reset: %w", err)
- }
return nil
}But it will fail quickly: And 5 seconds later it works: So perhaps |
|
Quick and dirty patch just trying to get (warning: half Claude generated, half modified by me remembering Go, so any bug of this is most likely on me). diff --git a/internal/cli/cli.go b/internal/cli/cli.go
index 9ef6bd7..989a6d6 100644
--- a/internal/cli/cli.go
+++ b/internal/cli/cli.go
@@ -172,7 +172,7 @@ func (cli *CLI) initRemoteMachine(ctx context.Context, opts InitClusterOptions)
}
// Ensure machineClient is closed on error.
defer func() {
- if err != nil {
+ if err != nil && machineClient != nil {
machineClient.Close()
}
}()
@@ -183,7 +183,8 @@ func (cli *CLI) initRemoteMachine(ctx context.Context, opts InitClusterOptions)
return nil, fmt.Errorf("inspect machine: %w", err)
}
if minfo.Id != "" {
- if err = promptResetMachine(ctx, machineClient.MachineClient); err != nil {
+ machineClient, err = promptResetMachine(ctx, machineClient, opts.RemoteMachine)
+ if err != nil {
return nil, err
}
}
@@ -301,7 +302,7 @@ func (cli *CLI) AddMachine(ctx context.Context, opts AddMachineOptions) (*client
return nil, nil, err
}
defer func() {
- if err != nil {
+ if err != nil && machineClient != nil {
machineClient.Close()
}
}()
@@ -323,7 +324,8 @@ func (cli *CLI) AddMachine(ctx context.Context, opts AddMachineOptions) (*client
return nil, nil, fmt.Errorf("machine is already a member of this cluster (%s)", minfo.Name)
}
- if err = promptResetMachine(ctx, machineClient.MachineClient); err != nil {
+ machineClient, err = promptResetMachine(ctx, machineClient, opts.RemoteMachine)
+ if err != nil {
return nil, nil, err
}
}
diff --git a/internal/cli/machine.go b/internal/cli/machine.go
index 8c57408..4a3945b 100644
--- a/internal/cli/machine.go
+++ b/internal/cli/machine.go
@@ -11,6 +11,7 @@ import (
"github.com/charmbracelet/huh"
"github.com/psviderski/uncloud/internal/machine/api/pb"
"github.com/psviderski/uncloud/internal/sshexec"
+ "github.com/psviderski/uncloud/pkg/client"
"google.golang.org/protobuf/types/known/emptypb"
)
@@ -86,7 +87,9 @@ func provisionMachine(ctx context.Context, exec sshexec.Executor, version string
return nil
}
-func promptResetMachine(ctx context.Context, machineClient pb.MachineClient) error {
+func promptResetMachine(
+ ctx context.Context, oldClient *client.Client, remoteMachine *RemoteMachine,
+) (*client.Client, error) {
var confirm bool
form := huh.NewForm(
huh.NewGroup(
@@ -103,23 +106,57 @@ func promptResetMachine(ctx context.Context, machineClient pb.MachineClient) err
),
).WithAccessible(true)
if err := form.Run(); err != nil {
- return fmt.Errorf("prompt user to confirm: %w", err)
+ return nil, fmt.Errorf("prompt user to confirm: %w", err)
}
if !confirm {
- return fmt.Errorf("remote machine is already initialised as a cluster member")
+ return nil, fmt.Errorf("remote machine is already initialised as a cluster member")
}
- if _, err := machineClient.Reset(ctx, &pb.ResetRequest{}); err != nil {
- return fmt.Errorf("reset remote machine: %w. You can also manually run 'uncloud-uninstall' "+
+ if _, err := oldClient.Reset(ctx, &pb.ResetRequest{}); err != nil {
+ return nil, fmt.Errorf("reset remote machine: %w. You can also manually run 'uncloud-uninstall' "+
"on the remote machine to fully uninstall Uncloud from it", err)
}
+
+ // Close the old connection since the reset will stop uncloudd, making the connection invalid
+ // (especially for SSH CLI connections where the dial-stdio pipe closes).
+ oldClient.Close()
+
fmt.Println("Resetting the remote machine...")
- if err := waitMachineReady(ctx, machineClient, 1*time.Minute); err != nil {
- return fmt.Errorf("wait for machine to be ready after reset: %w", err)
+
+ // Reconnect to the remote machine after reset with retry logic. We need to retry the
+ // reconnection itself (not just the Inspect call) because uncloudd might still be shutting down.
+ // If we connect during shutdown, we get a connection to a dying process that fails immediately.
+ boff := backoff.WithContext(backoff.NewExponentialBackOff(
+ backoff.WithMaxInterval(1*time.Second),
+ backoff.WithMaxElapsedTime(1*time.Minute),
+ ), ctx)
+
+ var newClient *client.Client
+ reconnect := func() error {
+ // Attempt to reconnect. The skipInstall parameter is true because the machine is already provisioned.
+ var err error
+ newClient, err = provisionOrConnectRemoteMachine(ctx, remoteMachine, true, "")
+ if err != nil {
+ return fmt.Errorf("connect to remote machine: %w", err)
+ }
+
+ // Verify the connection works by attempting an Inspect call.
+ // This ensures uncloudd has fully restarted and is serving requests.
+ if _, err := newClient.Inspect(ctx, &emptypb.Empty{}); err != nil {
+ newClient.Close()
+ newClient = nil
+ return fmt.Errorf("verify connection: %w", err)
+ }
+
+ return nil
}
- return nil
+ if err := backoff.Retry(reconnect, boff); err != nil {
+ return nil, fmt.Errorf("reconnect to remote machine after reset: %w", err)
+ }
+
+ return newClient, nil
}
// waitMachineReady waits for the machine to be ready to serve requests.but this renders I saw |
|
Let's merge this PR and discuss the fix for Regarding retries and reuse of the uncloud/pkg/client/connector/ssh.go Lines 62 to 76 in 2969b40
While uncloud/pkg/client/connector/sshcli.go Lines 93 to 95 in 48d2239
I think we can update the grpc.WithContextDialer function to also try to create a new connection with ssh ... uncloudd dial-stdio so that the existing code will work the same way it works with SSHConnector if I'm not missing anything.
|
Hello!
This is a follow up to #131 and #152 that completes the usage of
sshCLI tool across all the possible connections.Well, bear with me as I rant over the beauty of dealing with legacy and complicated network setups and security scenarios 😅
Sometimes machines that you need to SSH into might not have a direct connection to you: they might be within private networks or require different tricks to access them.
Non-publicly exposed machines
Certain nodes on a network might only have private IP addresses, but you can connect to them using SSH's built-in
ProxyJumpsupport, allowing you to use a machine that is exposed to the outside world and can talk to the internal network (also known as jump boxes or bastion servers).That machine itself might use a different SSH key than the one you need to SSH into your node, because security! 🤡
You can say you could implement all this with Go's built-in library, but I don't think is Uncloud's main priority to duplicate that, document its usage and not to mention maintain it afterwards!
The simplest solution: leverage on existing tooling that does that, which happens to be
sshtool.Authenticating with too many keys, public keys or anything other than private keys
In some security scenarios, your SSH agent might be hardened so you don't have access to private SSH keys locally, but instead leverage on tooling like 1Password SSH agent, Yubikey SSH keys or similar.
In the case of 1Password (and I guess similar to other password manager SSH agents), all your keys are exposed to the agent, which can cause the common too many authentication failures when trying to connect to a server and not indicating which key to use.
When using regular SSH, you can narrow down that list of SSH keys by using
IdentitiesOnlyandIdentityFileinstructions in your SSH configuration.Well my friend, you can use the public part of an SSH key so it narrows down the list of private keys it will use.
From
man ssh_configaboutIdentityFile:I know... magic ✨
And while working on open-source, I prefer the magic be managed by someone else than me, so, why not leverage on years of experience from the SSH developers and use it?
What this PR brings
The changes are very naive and might be repetitive, just duplicating current SSH functionality and allow
machine initandmachine addto shell out and usesshto establish connection to the indicated node.This of course records that into the configuration file using
ssh+cli://URI scheme discussed in #152 so we can make a distinction between regular, built-in Go SSH implementation (ssh://or scheme-less usage by default) and this new implementation.When I say naive implementation is because I've duplicated
buildSSHArgsbetween the connector usage just for the purposes of building the Executor. That by itself is worth some refactoring.I'm not proud of the branching paths in
provisionOrConnectRemoteMachineor that each timesshexec.NewSSHCLIRemoteis invoked, a new SSH client is spawned (instead of using a persistent one).The performance impact was negligible when provisioning 3-5 machines, but it adds up from there.
Given that these commands aren't invoked all the time, I would say it will be an small price to pay when initializing a cluster or adding machines to it.
But as I learnt over the years: first make it right (aka: work), then make it fast 😊
My last nitpick was lack of e2e testing for this, but since I couldn't get the setup to work locally, I couldn't fully test it.
I will try to report on that once I have time.
Thank you again for creating Uncloud and making it available to others!
❤️ ❤️ ❤️