-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: circuit-break requests to unavailable ranges #33007
Copy link
Copy link
Closed
Labels
A-kv-recoveryA-kv-replicationRelating to Raft, consensus, and coordination.Relating to Raft, consensus, and coordination.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)N-followupNeeds followup.Needs followup.O-communityOriginated from the communityOriginated from the communityO-postmortemOriginated from a Postmortem action item.Originated from a Postmortem action item.O-sreFor issues SRE opened or otherwise cares about tracking.For issues SRE opened or otherwise cares about tracking.S-3-ux-surpriseIssue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.T-kvKV TeamKV Team
Description
Describe the problem
I'm testing multi-region deployment. I've used performance tuning docs, installed multi-datacenter cluster. DC names are DCPRI and DCSEC with 3 virtual machines in each.
When I shutdown system in DCSEC, cluster on DCPRI is unresponsive, no SQL's work, just waits. Also, admin GUI timeouts.
Is it because my configuration, bug or is it expected behaviour?
Here is start parameters of cockroach instances:
# DCPRI/Node 1
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcpri --listen-addr=10.35.14.101 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# DCPRI/Node 4
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcpri --listen-addr=10.35.14.102 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# DCPRI/Node 3
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcpri --listen-addr=10.35.14.103 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# Node 4 DCSEC
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcsec --listen-addr=10.35.14.104 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# Node 5 DCSEC
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcsec --listen-addr=10.35.14.105 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
# Node 6 DCSEC
ExecStart=/usr/local/bin/cockroach start --store=path=/data/db/database,attrs=hdd --certs-dir=/data/db/certs --log-dir=/logs/CRDB --port=26257 --http-port=8088 --locality=region=ist,datacenter=dcsec --listen-addr=10.35.14.106 --join=10.10.14.101,10.10.14.102,10.10.14.103,10.10.14.104,10.10.14.105,10.10.14.106 --cache=.35 --max-sql-memory=.25
Logs from the nodes in DCPRI
DCPRI/Node1:
W181210 15:30:45.781750 136672 vendor/google.golang.org/grpc/clientconn.go:942 Failed to dial 10.10.14.105:26257: context canceled; please retry.
W181210 15:30:45.781790 139 storage/store.go:3910 [n1,s1] handle raft ready: 1.3s [processed=0]
I181210 15:30:46.795946 136580 server/authentication.go:374 Web session error: http: named cookie not present
I181210 15:30:46.799351 136228 server/authentication.go:374 Web session error: http: named cookie not present
W181210 15:30:46.909608 136729 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {10.10.14.104:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.104:26257: connect: no route to host". Reconnecting...
W181210 15:30:46.909690 136729 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {10.10.14.104:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:30:46.909704 136729 vendor/google.golang.org/grpc/clientconn.go:942 Failed to dial 10.10.14.104:26257: grpc: the connection is closing; please retry.
W181210 15:30:46.909709 142 storage/store.go:3910 [n1,s1] handle raft ready: 1.8s [processed=0]
DCPRI/Node4:
W181210 15:31:22.968999 71750 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {10.10.14.105:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.105:26257: connect: no route to host". Reconnecting...
I181210 15:31:22.969066 136 rpc/nodedialer/nodedialer.go:91 [n4] unable to connect to n6: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.10.14.105:26257: connect: no route to host"
W181210 15:31:22.969068 71750 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {10.10.14.105:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:31:22.969084 136 storage/store.go:3910 [n4,s4] handle raft ready: 1.1s [processed=0]
W181210 15:31:22.969095 71750 vendor/google.golang.org/grpc/clientconn.go:942 Failed to dial 10.10.14.105:26257: context canceled; please retry.
W181210 15:31:23.772921 71770 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {10.10.14.106:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.106:26257: connect: no route to host". Reconnecting...
W181210 15:31:23.772980 71770 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {10.10.14.106:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:31:23.772988 71770 vendor/google.golang.org/grpc/clientconn.go:942 Failed to dial 10.10.14.106:26257: grpc: the connection is closing; please retry.
W181210 15:31:23.773029 112 storage/store.go:3910 [n4,s4] handle raft ready: 1.3s [processed=0]
DCPRI/Node3:
I181210 15:31:50.649944 139 server/status/runtime.go:465 [n3] runtime stats: 253 MiB RSS, 224 goroutines, 140 MiB/18 MiB/175 MiB GO alloc/idle/total, 56 MiB/72 MiB CGO alloc/total, 24.1 CGO/sec, 0.7/0.3 %(u/s)time, 0.0 %gc (0x), 108 KiB/80 KiB (r/w)net
W181210 15:31:50.941579 69897 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {10.10.14.106:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.106:26257: connect: no route to host". Reconnecting...
W181210 15:31:50.941670 69897 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {10.10.14.106:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:31:50.941683 69897 vendor/google.golang.org/grpc/clientconn.go:942 Failed to dial 10.10.14.106:26257: grpc: the connection is closing; please retry.
W181210 15:31:51.217542 69854 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {10.10.14.105:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.10.14.105:26257: connect: no route to host". Reconnecting...
W181210 15:31:51.217615 31 storage/store.go:3910 [n3,s3] handle raft ready: 1.2s [processed=0]
W181210 15:31:51.217622 69854 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {10.10.14.105:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W181210 15:31:51.217648 69854 vendor/google.golang.org/grpc/clientconn.go:942 Failed to dial 10.10.14.105:26257: context canceled; please retry.
Epic: CRDB-2553
Jira issue: CRDB-6349
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
A-kv-recoveryA-kv-replicationRelating to Raft, consensus, and coordination.Relating to Raft, consensus, and coordination.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)N-followupNeeds followup.Needs followup.O-communityOriginated from the communityOriginated from the communityO-postmortemOriginated from a Postmortem action item.Originated from a Postmortem action item.O-sreFor issues SRE opened or otherwise cares about tracking.For issues SRE opened or otherwise cares about tracking.S-3-ux-surpriseIssue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.T-kvKV TeamKV Team