-
-
Notifications
You must be signed in to change notification settings - Fork 116
Description
I have a cluster, currently with two public machines:
- thebe (IP: 93.xxx)
- amalthea (IP: 65.xxx)
I've just set up a server at home, behind a NAT, and want to add it:
- himalia (external IP: 174.xxx, internal IP: 168.xxx)
(Note, I've worked through some of these issues last night, and I'm trying to document my steps from here as best as I can from notes. So some details might be slightly off of what I actually did.)
I added himalia with:
uc machine add -n himalia admin@192.xxx -i ~/.ssh/key --no-caddy
In case it's relevant, I had manually installed docker (apt install docker.io) on himalia before adding it to uncloud.
That ran without issues. I noticed in uc machine ls the state was listed as down and it had a mix of internal and external (the NAT) IPs, both v4 and v6 in its wireguard endpoints. It looked roughly like this:
himalia Down 10.210.2.1/24 174.xxx 192.xxx:51820, [fdc3:xxx]:51820, [2600:xxx]:51820, [2600:xxx:yyy]:51820, 174.xxx:51820 4665xxx
I then tried doing something with it (uc volume create iirc) which failed. uc connected through one of the public servers and some error about not being able to connect to himalia, I think.
I then starting debugging on the instances, looking at the data under /var/lib/uncloud on various machines. In corrosion/store.db I believe thebe/amalthea had an entry for himalia in their machines table, but himalia's table only had itself.
I then started looking at the wireguard network, and it was similar:
thebe:
interface: uncloud
public key: 1Wgi...
private key: (hidden)
listening port: 51820
peer: 9dKc...
endpoint: 65.xxx:51820
allowed ips: fdcc:...:c8b1/128, 10.210.1.0/24
latest handshake: 1 minute, 7 seconds ago
transfer: 964.68 MiB received, 948.41 MiB sent
persistent keepalive: every 25 seconds
peer: FvIQ...
endpoint: [2600:...:3eb4]:51820
allowed ips: fdcc:...:452d/128, 10.210.2.0/24
latest handshake: 2 hours, 46 minutes, 9 seconds ago
transfer: 33.36 KiB received, 209.48 KiB sent
persistent keepalive: every 25 seconds
amalthea:
interface: uncloud
public key: 9dKc...
private key: (hidden)
listening port: 51820
peer: 1Wgi...
endpoint: 93.xxx:51820
allowed ips: fdcc:...:9275/128, 10.210.0.0/24
latest handshake: 1 minute, 29 seconds ago
transfer: 948.63 MiB received, 966.63 MiB sent
persistent keepalive: every 25 seconds
peer: FvIQ...
endpoint: [2600:...:fef9]:51820
allowed ips: fdcc:...:452d/128, 10.210.2.0/24
latest handshake: 2 hours, 48 minutes, 33 seconds ago
transfer: 21.02 KiB received, 220.53 KiB sent
persistent keepalive: every 25 seconds
himalia:
interface: uncloud
public key: FvIQ...
private key: (hidden)
listening port: 51820
So thebe/amalthea knew about himalia, but himalia didn't know about them or something? I also saw the endpoint for himalia that thebe/amalthea were reporting in wireguard was cycling through the various endpoints I saw in uc machine ls.
I looked at my router and realized it did not have port triggering enabled by default, so I added a fixed port UDP port forward for 51820 to the (assigned) internal 192.xxx IP for himalia.
It didn't recover on its on, so I then did a uc machine rm and re-added it. This second time, I also added a --public-ip none option:
uc machine add -n himalia admin@192.xxx -i ~/.ssh/key --no-caddy --public-ip none
But everything looked pretty much the same after that. I also tried it again, but deleting the /var/lib/uncloud directory after removing it before adding it again, but kept getting the same result.
Then I started digging into the networking more. To confirm the port forwarding, I disabled himalia's wireguard interface and listened on 51820 directly:
nc -luv -p 51820
Bound on 0.0.0.0 51820
Connection received on 65.xxx 51820
��r��;�ı��"P"��,���K���X��N��N;+����u�V��%�u�K{��7�...
That was amalthea. I ran that a couple times and also saw connections and data from thebe (93.xxx). So it was sending/receiving correctly.
I think started trying to modify himali's /var/lib/uncloud/machine.json directly. It initially looked like this:
{
"ID": "4665...",
"Name": "himalia",
"Network": {
"Subnet": "10.210.2.0/24",
"ManagementIP": "fdcc:...:a5f",
"PrivateKey": "...",
"PublicKey": "..."
}
While thebe/amalthea had a "Peers" section with the other plus himalia, like this:
"Peers": [
{
"Subnet": "10.210.1.0/24",
"ManagementIP": "fdcc:...:c8b1",
"Endpoint": "65....:51820",
"AllEndpoints": [
"65....:51820"
],
"PublicKey": "..."
},
{
"Subnet": "10.210.2.0/24",
"ManagementIP": "fdcc:...:a5f",
"Endpoint": "[fdc3:...:3eb4]:51820",
"AllEndpoints": [
"192.xxx:51820",
"[fdc3:...:b94e:...:3eb4]:51820",
"[2600:...::fef9]:51820",
"[2600:...:429c:...:3eb4]:51820",
"174.xxx:51820"
],
"PublicKey": "..."
}
]
So I shutdown the uncloud daemon (systemctl stop uncloud) and then edited himalia's machine.json to add a "Peers" section with the entries for thebe/amalthea that I copied from those servers.
Then restarted the daemon and it immediately overwrote that file back to the original (peer-less) version.
Thinking maybe it was actually getting that from corrosion/store.db not the json file, I then tried the same thing but also copied the store.db from the thebe server (using sqlite3_rsync). But it had no visible effect. The store.db kept the entries, though, but wiped the machine.json and wireguard network still had no peers.
This was a snippet from the uncloud.service logs around that time:
Oct 25 06:28:55 himalia uncloudd[7738]: INFO Configured WireGuard interface. name=uncloud Oct 25 06:28:55 himalia uncloudd[7738]: INFO Updated addresses of the WireGuard interface. name=uncloud add> Oct 25 06:28:55 himalia uncloudd[7738]: DEBUG Removed route to peer(s) via WireGuard interface. name=uncloud> Oct 25 06:28:55 himalia uncloudd[7738]: DEBUG Removed route to peer(s) via WireGuard interface. name=uncloud> Oct 25 06:28:55 himalia uncloudd[7738]: DEBUG Removed route to peer(s) via WireGuard interface. name=uncloud> Oct 25 06:28:55 himalia uncloudd[7738]: INFO Updated routes to peers via the WireGuard interface. name=uncl> Oct 25 06:28:55 himalia uncloudd[7738]: INFO Subscribed to container changes in the cluster to keep DNS rec> Oct 25 06:28:55 himalia uncloudd[7738]: INFO Subscribed to container changes in the cluster to generate Cad> Oct 25 06:28:55 himalia uncloudd[7738]: DEBUG DNS records updated. component=dns-resolver services=4 contain> Oct 25 06:28:55 himalia uncloudd[7738]: DEBUG Caddy is not running on this machine, skipping configuration l> ~
The "Removed route to peer" bit made me think it might be using the wireguard interface itself as the canonical source at that point, so I then removed the uncloud interface and created a new one manually with the proper peer configuration.
[Interface]
PrivateKey = <hidden>
ListenPort = 51820
[Peer]
# thebe
PublicKey = 1Wgi...
AllowedIPs = 10.210.0.0/24, fdcc:...:9275/128
Endpoint = 93.xxx:51820
PersistentKeepalive = 25
[Peer]
# amalthea
PublicKey = 9dKc...
AllowedIPs = 10.210.1.0/24, fdcc:...:c8b1/128
Endpoint = 65.xxx:51820
PersistentKeepalive = 25
After that, the wireguard network started working properly both ways:
himalia:
interface: uncloud
public key: LnHx...
private key: (hidden)
listening port: 51820
peer: 1Wgi...
endpoint: 93.xxx:51820
allowed ips: 10.210.0.0/24, fdcc:...:9275/128
latest handshake: 1 minute, 26 seconds ago
transfer: 156 B received, 276 B sent
persistent keepalive: every 25 seconds
peer: 9dKc...
endpoint: 65.xxx:51820
allowed ips: 10.210.1.0/24, fdcc:...:c8b1/128
latest handshake: 1 minute, 26 seconds ago
transfer: 124 B received, 276 B sent
persistent keepalive: every 25 seconds
I restarted the uncloud service, and I was then able to create the volume I'd initially started with. I was also able to pussh (unregistry) a image to himalia (via a thebe connection from the cli).
However, then I started seeing other issues:
uc machine lsstill shows it as "Down"- a "global deploy" service did not "see" himalia
I was able to deploy the global service after adding a x-machines setting with all three listed, though, and it then deployed it to himalia as well.
I think used docker exec ... sh on the thebe instance of the service and ran nslookup and it only returned results for itself and amalthea. I knew the service IP for the himalia instance, so I pinged it and it worked and I was able to connect to a daemon running on it, so the network was working, just the thebe (and amalthea) internal DNS servers didn't know about it?
I then docker exec'd into the himalia container and was able to connect to the thebe/amalthea instances, and its DNS listed all three IPs like I would expect.
So I think the problem I have now is some internal state in uncloud that is a bit out of sync, making it think it is down, not including it in some machine iterators for global or for event listener in internal DNS, etc.