Skip to content

The value of "datacenter" for TLS ClientHello in a WAN setup is not dynamic? #5357

@splashx

Description

@splashx

Overview of the Issue

We're finishing up deployment of a two DC setup, 5 servers in each DC and we're at the final stage where we're bootstrapping TLS. We have managed to have a healthy status for each DC (LAN) but when it comes to WAN, the two clusters can't talk to each other - gossip works, but grpc doesnt.

We see the following in one of the clusters:

ubuntu@server03:~$ ./consul members -wan
Node                 Address             Status  Type    Build  Protocol  DC          Segment
server01.my-dc1  10.125.81.9:8302    alive   server  1.2.3  2         my-dc1  <all>
server01.my-dc2  10.125.25.133:8302  alive   server  1.2.3  2         my-dc2  <all>
server02.my-dc2  10.125.25.137:8302  alive   server  1.2.3  2         my-dc2  <all>
server03.my-dc1  10.125.81.7:8302    alive   server  1.2.3  2         my-dc1  <all>
server03.my-dc2  10.125.25.135:8302  alive   server  1.2.3  2         my-dc2  <all>
server04.my-dc1  10.125.81.8:8302    alive   server  1.2.3  2         my-dc1  <all>
server04.my-dc2  10.125.25.134:8302  alive   server  1.2.3  2         my-dc2  <all>
server05.my-dc1  10.125.81.6:8302    alive   server  1.2.3  2         my-dc1  <all>
server05.my-dc2  10.125.25.136:8302  alive   server  1.2.3  2         my-dc2  <all>
ubuntu@server03:~$ ./consul members 
Node      Address             Status  Type    Build  Protocol  DC          Segment
server01  10.125.25.133:8301  alive   server  1.2.3  2         my-dc2  <all>
server02  10.125.25.137:8301  alive   server  1.2.3  2         my-dc2  <all>
server03  10.125.25.135:8301  alive   server  1.2.3  2         my-dc2  <all>
server04  10.125.25.134:8301  alive   server  1.2.3  2         my-dc2  <all>
server05  10.125.25.136:8301  alive   server  1.2.3  2         my-dc2  <all>

But:

ubuntu@server03:~$ ./consul catalog nodes 
Node      ID        Address        DC
server01  b71357ef  10.125.25.133  my-dc2
server02  3975db7d  10.125.25.137  my-dc2
server03  714b613f  10.125.25.135  my-dc2
server04  6d9c580a  10.125.25.134  my-dc2
server05  3fdbda37  10.125.25.136  my-dc2
ubuntu@server03:~$ ./consul catalog nodes -datacenter=my-dc1
Error listing nodes: Unexpected response code: 500 (No path to datacenter)

⚠️ It's important to node that we don't actually use my-dc1 / my-dc2 - we use a string in the format [a-z]{1,2}\-[a-z]{1,7}[0-9]+.

Reproduction Steps

⚠️ The certificates were created using Consul 1.4.1 (because we had done several rounds of trial of cert generation earlier without success, but that's unrelated):

  • consul_domain: sd.example.com
  1. Set up 2 clusters with 5 nodes each
  2. Set datacenter config variable to my-dc1 and my-dc2 on the 2 clusters
  3. Set server_name to server.my-dc1.sd.example.com on each node in my-dc1
  4. Set server_name to server.my-dc2.sd.example.com on each node in my-dc2
  5. Create TLS certififcates using consul tls commands from v1.4.1 (or openssl with similar CN, SANs configuration).
  6. consul tls ca create -domain=sd.example.com
  7. consul tls cert create -dc=my-dc1 -domain=sd.example.com -server (repeat 5 times)
  8. consul tls cert create -dc=my-dc2 -domain=sd.example.com -server (repeat 5 times)
  9. set ca_file, cert_file key_file variables on each node to have common CA file, and individual/own respective cert and key files
  10. start nodes in clusters

NOTE: the certificate creation steps were based on the manual.

Consul info for both Client and Server

We see the following error message on my-dc1 (the ip addresses in 10.125.25.0/24 are from my-dc2)

consul.rpc: failed to read byte: remote error: tls: bad certificate from=10.125.25.134:35610
consul.rpc: failed to read byte: remote error: tls: bad certificate from=10.125.25.136:51210
consul.rpc: failed to read byte: remote error: tls: bad certificate from=10.125.25.137:48500
consul.rpc: failed to read byte: remote error: tls: bad certificate from=10.125.25.133:25610
consul.rpc: failed to read byte: remote error: tls: bad certificate from=10.125.25.135:31410

Whe dug a bit deeper (a.k.a tcpdump) to try to decipher what bad certificate would really mean and we noticed that when nodes from my-dc1 try contact nodes of my-dc2, they are sending a ClientHello message with the server_name value of server.my-dc1.<our_domain> - and obviously that will fail, because the nodes from my-dc2:

  • Don't have CN containing server.my-dc1.<our_domain> nor
  • Have they subjectAltName containing server.my-dc1.<our_domain>

And thus TLS handshake fails.

To solve this problem we had to reissue all the certificates to contain all dcs in the subjectAltName. This is a problem because for every new added cluster in a new DC we need to reissue all certificates to include that new DC in the subjectAltName.

I suppose this is not the desired behavior - IMHO when contacting an IP address of another DC, consul, when acting as a TLS client, should dynamically change the server_name value to dynamically match server.<dc_name>.<consul_domain>.

Operating system and Environment details

Consul 1.2.3, Ubuntu 16.04

Metadata

Metadata

Assignees

No one assigned

    Labels

    theme/tlsUsing TLS (Transport Layer Security) or mTLS (mutual TLS) to secure communicationtype/bugFeature does not function as expected

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions