Adding a homelab machine (behind NAT, no public IP)

I have a cluster, currently with two public machines:
- thebe (IP: 93.xxx)
- amalthea (IP: 65.xxx)

I've just set up a server at home, behind a NAT, and want to add it:
- himalia (external IP: 174.xxx, internal IP: 168.xxx)

(Note, I've worked through some of these issues last night, and I'm trying to document my steps from here as best as I can from notes. So some details might be slightly off of what I _actually_ did.)


I added himalia with:
```
uc machine add -n himalia admin@192.xxx -i ~/.ssh/key --no-caddy
```

In case it's relevant, I had manually installed docker (`apt install docker.io`) on himalia before adding it to uncloud.

That ran without issues. I noticed in `uc machine ls` the state was listed as `down` and it had a mix of internal and external (the NAT) IPs, both v4 and v6 in its wireguard endpoints. It looked roughly like this:
```
himalia    Down    10.210.2.1/24   174.xxx  192.xxx:51820, [fdc3:xxx]:51820, [2600:xxx]:51820, [2600:xxx:yyy]:51820, 174.xxx:51820   4665xxx
```

I then tried doing something with it (`uc volume create` iirc) which failed. uc connected through one of the public servers and some error about not being able to connect to himalia, I think.

I then starting debugging on the instances, looking at the data under `/var/lib/uncloud` on various machines. In `corrosion/store.db` I believe thebe/amalthea had an entry for himalia in their `machines` table, but himalia's table only had itself.

I then started looking at the wireguard network, and it was similar:

thebe:
```
interface: uncloud
  public key: 1Wgi...
  private key: (hidden)
  listening port: 51820

peer: 9dKc...
  endpoint: 65.xxx:51820
  allowed ips: fdcc:...:c8b1/128, 10.210.1.0/24
  latest handshake: 1 minute, 7 seconds ago
  transfer: 964.68 MiB received, 948.41 MiB sent
  persistent keepalive: every 25 seconds

peer: FvIQ...
  endpoint: [2600:...:3eb4]:51820
  allowed ips: fdcc:...:452d/128, 10.210.2.0/24
  latest handshake: 2 hours, 46 minutes, 9 seconds ago
  transfer: 33.36 KiB received, 209.48 KiB sent
  persistent keepalive: every 25 seconds
```

amalthea:
```
interface: uncloud
  public key: 9dKc...
  private key: (hidden)
  listening port: 51820

peer: 1Wgi...
  endpoint: 93.xxx:51820
  allowed ips: fdcc:...:9275/128, 10.210.0.0/24
  latest handshake: 1 minute, 29 seconds ago
  transfer: 948.63 MiB received, 966.63 MiB sent
  persistent keepalive: every 25 seconds

peer: FvIQ...
  endpoint: [2600:...:fef9]:51820
  allowed ips: fdcc:...:452d/128, 10.210.2.0/24
  latest handshake: 2 hours, 48 minutes, 33 seconds ago
  transfer: 21.02 KiB received, 220.53 KiB sent
  persistent keepalive: every 25 seconds
```

himalia:
```
interface: uncloud
  public key: FvIQ...
  private key: (hidden)
  listening port: 51820
```

So thebe/amalthea knew about himalia, but himalia didn't know about them or something? I also saw the endpoint for himalia that thebe/amalthea were reporting in wireguard was cycling through the various endpoints I saw in `uc machine ls`.

I looked at my router and realized it did not have port triggering enabled by default, so I added a fixed port UDP port forward for 51820 to the (assigned) internal 192.xxx IP for himalia.

It didn't recover on its on, so I then did a `uc machine rm` and re-added it. This second time, I also added a `--public-ip none` option:
```
uc machine add -n himalia admin@192.xxx -i ~/.ssh/key --no-caddy --public-ip none
```

But everything looked pretty much the same after that. I also tried it again, but deleting the `/var/lib/uncloud` directory after removing it before adding it again, but kept getting the same result.

Then I started digging into the networking more. To confirm the port forwarding, I disabled himalia's wireguard interface and listened on 51820 directly:
```
nc -luv -p 51820
Bound on 0.0.0.0 51820

Connection received on 65.xxx 51820
��r��;�ı�"P"��,��K���X�N��N;+����u�V�%uK{��7�...
```

That was amalthea. I ran that a couple times and also saw connections and data from thebe (93.xxx). So it was sending/receiving correctly.

I think started trying to modify himali's `/var/lib/uncloud/machine.json` directly. It initially looked like this:

```
{
  "ID": "4665...",
  "Name": "himalia",
  "Network": {
    "Subnet": "10.210.2.0/24",
    "ManagementIP": "fdcc:...:a5f",
    "PrivateKey": "...",
    "PublicKey": "..."
  }
```

While thebe/amalthea had a "Peers" section with the other plus himalia, like this:
```
"Peers": [
    {
      "Subnet": "10.210.1.0/24",
      "ManagementIP": "fdcc:...:c8b1",
      "Endpoint": "65....:51820",
      "AllEndpoints": [
        "65....:51820"
      ],
      "PublicKey": "..."
    },
    {
      "Subnet": "10.210.2.0/24",
      "ManagementIP": "fdcc:...:a5f",
      "Endpoint": "[fdc3:...:3eb4]:51820",
      "AllEndpoints": [
        "192.xxx:51820",
        "[fdc3:...:b94e:...:3eb4]:51820",
        "[2600:...::fef9]:51820",
        "[2600:...:429c:...:3eb4]:51820",
        "174.xxx:51820"
      ],
      "PublicKey": "..."
    }
  ]
```

So I shutdown the uncloud daemon (`systemctl stop uncloud`) and then edited himalia's `machine.json` to add a "Peers" section with the entries for thebe/amalthea that I copied from those servers.

Then restarted the daemon and it immediately overwrote that file back to the original (peer-less) version.

Thinking maybe it was actually getting that from `corrosion/store.db` not the json file, I then tried the same thing but _also_ copied the store.db from the thebe server (using `sqlite3_rsync`). But it had no visible effect. The store.db kept the entries, though, but wiped the machine.json and wireguard network still had no peers.

This was a snippet from the uncloud.service logs around that time:

```
Oct 25 06:28:55 himalia uncloudd[7738]: INFO Configured WireGuard interface. name=uncloud Oct 25 06:28:55 himalia uncloudd[7738]: INFO Updated addresses of the WireGuard interface. name=uncloud add> Oct 25 06:28:55 himalia uncloudd[7738]: DEBUG Removed route to peer(s) via WireGuard interface. name=uncloud> Oct 25 06:28:55 himalia uncloudd[7738]: DEBUG Removed route to peer(s) via WireGuard interface. name=uncloud> Oct 25 06:28:55 himalia uncloudd[7738]: DEBUG Removed route to peer(s) via WireGuard interface. name=uncloud> Oct 25 06:28:55 himalia uncloudd[7738]: INFO Updated routes to peers via the WireGuard interface. name=uncl> Oct 25 06:28:55 himalia uncloudd[7738]: INFO Subscribed to container changes in the cluster to keep DNS rec> Oct 25 06:28:55 himalia uncloudd[7738]: INFO Subscribed to container changes in the cluster to generate Cad> Oct 25 06:28:55 himalia uncloudd[7738]: DEBUG DNS records updated. component=dns-resolver services=4 contain> Oct 25 06:28:55 himalia uncloudd[7738]: DEBUG Caddy is not running on this machine, skipping configuration l> ~
```

The "Removed route to peer" bit made me think it might be using the wireguard interface itself as the canonical source at that point, so I then removed the uncloud interface and created a new one manually with the proper peer configuration.

```
[Interface]
PrivateKey = <hidden>
ListenPort = 51820

[Peer]
# thebe
PublicKey = 1Wgi...
AllowedIPs = 10.210.0.0/24, fdcc:...:9275/128
Endpoint = 93.xxx:51820
PersistentKeepalive = 25

[Peer]
# amalthea
PublicKey = 9dKc...
AllowedIPs = 10.210.1.0/24, fdcc:...:c8b1/128
Endpoint = 65.xxx:51820
PersistentKeepalive = 25
```

After that, the wireguard network started working properly both ways:

himalia:
```
interface: uncloud 
  public key: LnHx... 
  private key: (hidden) 
  listening port: 51820

peer: 1Wgi...
  endpoint: 93.xxx:51820 
  allowed ips: 10.210.0.0/24, fdcc:...:9275/128 
  latest handshake: 1 minute, 26 seconds ago 
  transfer: 156 B received, 276 B sent 
  persistent keepalive: every 25 seconds

peer: 9dKc...
  endpoint: 65.xxx:51820 
  allowed ips: 10.210.1.0/24, fdcc:...:c8b1/128 
  latest handshake: 1 minute, 26 seconds ago 
  transfer: 124 B received, 276 B sent 
  persistent keepalive: every 25 seconds
```

I restarted the uncloud service, and I was then able to create the volume I'd initially started with. I was also able to pussh (unregistry) a image to himalia (via a thebe connection from the cli).

However, then I started seeing other issues:
1. `uc machine ls` still shows it as "Down"
2. a "global deploy" service did not "see" himalia

I was able to deploy the global service after adding a x-machines setting with all three listed, though, and it then deployed it to himalia as well.

I think used `docker exec ... sh` on the thebe instance of the service and ran `nslookup` and it only returned results for itself and amalthea. I knew the service IP for the himalia instance, so I pinged it and it worked and I was able to connect to a daemon running on it, so the network was working, just the thebe (and amalthea) internal DNS servers didn't know about it?

I then `docker exec`'d into the himalia container and was able to connect to the thebe/amalthea instances, and its DNS listed all three IPs like I would expect.

So I think the problem I have now is some internal state in uncloud that is a bit out of sync, making it think it is down, not including it in some machine iterators for global or for event listener in internal DNS, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Adding a homelab machine (behind NAT, no public IP) #155

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Adding a homelab machine (behind NAT, no public IP) #155

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions