-
Notifications
You must be signed in to change notification settings - Fork 4.1k
locality-advertise-addr is not wokring #42741
Description
Running a 9 nodes cluster on GCP with 19.2.0.
./cockroach start --cache=25% --max-sql-memory=35% --background --locality=cloud=gcp,region=us-east1,datacenter=us-east1-c --store=path=/mnt/d1,attrs=ssd,size=90% --log-dir=log --certs-dir=certs --max-disk-temp-storage=100GB --locality-advertise-addr=cloud=gcp@{Private IP},region=us-east1@{Private IP},datacenter=us-east1-c@{Private IP} --join={N1 Private IP},{N2 Private IP},{Nx Prive IP} --advertise-addr={Public IP}
Start all nodes, and it looks like all nodes are healthy
However, in the network diagnostics pages
Confirmed that all nodes are in the same region
If I shutdown the cluster and restart, on the network diagnostics pages it will become
On the problematic node, there will be spam with these log entries
W191125 17:58:47.883009 19657 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N3}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp {Public IP N3}:26257: i/o timeout". Reconnecting... I191125 17:58:48.663426 20622 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n4] circuitbreaker: gossip [::]:26257->{Public IP N9}:26257 tripped: initial connection heartbeat failed: operation "rpc heartbeat" timed out after 6s: rpc error: code = DeadlineExceeded desc = context deadline exceeded I191125 17:58:48.663437 20622 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 [n4] circuitbreaker: gossip [::]:26257->{Public IP N9}:26257 event: BreakerTripped W191125 17:58:48.883192 19657 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N3}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting... I191125 17:58:52.045207 187 server/status/runtime.go:498 [n4] runtime stats: 5.0 GiB RSS, 363 goroutines, 174 MiB/60 MiB/271 MiB GO alloc/idle/total, 4.1 GiB/4.8 GiB CGO alloc/total, 91.6 CGO/sec, 14.8/0.8 %(u/s)time, 0.0 %gc (1x), 606 KiB/456 KiB (r/w)net W191125 17:58:52.057512 182 server/node.go:745 [n4] [n4,s4]: unable to compute metrics: [n4,s4]: system config not yet available W191125 17:58:52.217886 161 storage/replica_range_lease.go:554 can't determine lease status due to node liveness error: node not in the liveness table github.com/cockroachdb/cockroach/pkg/storage.init.ializers /go/src/github.com/cockroachdb/cockroach/pkg/storage/node_liveness.go:44 runtime.main /usr/local/go/src/runtime/proc.go:188 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1337 W191125 17:58:57.217893 162 storage/replica_range_lease.go:554 can't determine lease status due to node liveness error: node not in the liveness table github.com/cockroachdb/cockroach/pkg/storage.init.ializers /go/src/github.com/cockroachdb/cockroach/pkg/storage/node_liveness.go:44 runtime.main /usr/local/go/src/runtime/proc.go:188 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1337 W191125 17:58:58.008692 20241 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N2}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp {Public IP N2}:26257: i/o timeout". Reconnecting... I191125 17:58:58.361063 19445 storage/store_snapshot.go:978 [n4,raftsnapshot,s4,r262/3:/Table/60/2/"5{9aaca…-b073e…}] sending LEARNER snapshot fcabe123 at applied index 2404159 I191125 17:58:58.517305 155 storage/store_remove_replica.go:129 [n4,s4,r262/3:/Table/60/2/"5{9aaca…-b073e…}] removing replica r262/3 W191125 17:58:59.008852 20241 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N2}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting... W191125 17:58:59.008859 20391 vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {{Public IP N8}:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp {Public IP N8}:26257: i/o timeout". Reconnecting... I191125 17:58:59.010597 21293 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n4] circuitbreaker: gossip [::]:26257->{Public IP N3}:26257 tripped: initial connection heartbeat failed: operation "rpc heartbeat" timed out after 6s: rpc error: code = DeadlineExceeded desc = context deadline exceeded I191125 17:58:59.010610 21293 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 [n4] circuitbreaker: gossip [::]:26257->{Public IP N3}:26257 event: BreakerTripped
If I start all nodes with --advertise-addr={Private IP}, everything back to normal.
Jira issue: CRDB-5327