-
Notifications
You must be signed in to change notification settings - Fork 4.6k
The value of "datacenter" for TLS ClientHello in a WAN setup is not dynamic? #5357
Description
Overview of the Issue
We're finishing up deployment of a two DC setup, 5 servers in each DC and we're at the final stage where we're bootstrapping TLS. We have managed to have a healthy status for each DC (LAN) but when it comes to WAN, the two clusters can't talk to each other - gossip works, but grpc doesnt.
We see the following in one of the clusters:
ubuntu@server03:~$ ./consul members -wan
Node Address Status Type Build Protocol DC Segment
server01.my-dc1 10.125.81.9:8302 alive server 1.2.3 2 my-dc1 <all>
server01.my-dc2 10.125.25.133:8302 alive server 1.2.3 2 my-dc2 <all>
server02.my-dc2 10.125.25.137:8302 alive server 1.2.3 2 my-dc2 <all>
server03.my-dc1 10.125.81.7:8302 alive server 1.2.3 2 my-dc1 <all>
server03.my-dc2 10.125.25.135:8302 alive server 1.2.3 2 my-dc2 <all>
server04.my-dc1 10.125.81.8:8302 alive server 1.2.3 2 my-dc1 <all>
server04.my-dc2 10.125.25.134:8302 alive server 1.2.3 2 my-dc2 <all>
server05.my-dc1 10.125.81.6:8302 alive server 1.2.3 2 my-dc1 <all>
server05.my-dc2 10.125.25.136:8302 alive server 1.2.3 2 my-dc2 <all>
ubuntu@server03:~$ ./consul members
Node Address Status Type Build Protocol DC Segment
server01 10.125.25.133:8301 alive server 1.2.3 2 my-dc2 <all>
server02 10.125.25.137:8301 alive server 1.2.3 2 my-dc2 <all>
server03 10.125.25.135:8301 alive server 1.2.3 2 my-dc2 <all>
server04 10.125.25.134:8301 alive server 1.2.3 2 my-dc2 <all>
server05 10.125.25.136:8301 alive server 1.2.3 2 my-dc2 <all>
But:
ubuntu@server03:~$ ./consul catalog nodes
Node ID Address DC
server01 b71357ef 10.125.25.133 my-dc2
server02 3975db7d 10.125.25.137 my-dc2
server03 714b613f 10.125.25.135 my-dc2
server04 6d9c580a 10.125.25.134 my-dc2
server05 3fdbda37 10.125.25.136 my-dc2
ubuntu@server03:~$ ./consul catalog nodes -datacenter=my-dc1
Error listing nodes: Unexpected response code: 500 (No path to datacenter)
my-dc1 / my-dc2 - we use a string in the format [a-z]{1,2}\-[a-z]{1,7}[0-9]+.
Reproduction Steps
- consul_domain:
sd.example.com
- Set up 2 clusters with 5 nodes each
- Set
datacenterconfig variable tomy-dc1andmy-dc2on the 2 clusters - Set
server_nametoserver.my-dc1.sd.example.comon each node inmy-dc1 - Set
server_nametoserver.my-dc2.sd.example.comon each node inmy-dc2 - Create TLS certififcates using consul tls commands from v1.4.1 (or openssl with similar CN, SANs configuration).
- consul tls ca create -domain=sd.example.com
- consul tls cert create -dc=my-dc1 -domain=sd.example.com -server (repeat 5 times)
- consul tls cert create -dc=my-dc2 -domain=sd.example.com -server (repeat 5 times)
- set ca_file, cert_file key_file variables on each node to have common CA file, and individual/own respective cert and key files
- start nodes in clusters
NOTE: the certificate creation steps were based on the manual.
Consul info for both Client and Server
We see the following error message on my-dc1 (the ip addresses in 10.125.25.0/24 are from my-dc2)
consul.rpc: failed to read byte: remote error: tls: bad certificate from=10.125.25.134:35610
consul.rpc: failed to read byte: remote error: tls: bad certificate from=10.125.25.136:51210
consul.rpc: failed to read byte: remote error: tls: bad certificate from=10.125.25.137:48500
consul.rpc: failed to read byte: remote error: tls: bad certificate from=10.125.25.133:25610
consul.rpc: failed to read byte: remote error: tls: bad certificate from=10.125.25.135:31410
Whe dug a bit deeper (a.k.a tcpdump) to try to decipher what bad certificate would really mean and we noticed that when nodes from my-dc1 try contact nodes of my-dc2, they are sending a ClientHello message with the server_name value of server.my-dc1.<our_domain> - and obviously that will fail, because the nodes from my-dc2:
- Don't have
CNcontainingserver.my-dc1.<our_domain>nor - Have they
subjectAltNamecontainingserver.my-dc1.<our_domain>
And thus TLS handshake fails.
To solve this problem we had to reissue all the certificates to contain all dcs in the subjectAltName. This is a problem because for every new added cluster in a new DC we need to reissue all certificates to include that new DC in the subjectAltName.
I suppose this is not the desired behavior - IMHO when contacting an IP address of another DC, consul, when acting as a TLS client, should dynamically change the server_name value to dynamically match server.<dc_name>.<consul_domain>.
Operating system and Environment details
Consul 1.2.3, Ubuntu 16.04