-
Notifications
You must be signed in to change notification settings - Fork 4.1k
stability: slow gossip convergence after restart #7668
Copy link
Copy link
Closed
Labels
S-1-stabilitySevere stability issues that can be fixed by upgrading, but usually don’t resolve by restartingSevere stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone
Description
While updating the beta cluster to feb9240, the cluster was unavailable for about 20 minutes with what looked like gossip problems. The logs were full of sequences like this:
ubuntu@104.196.5.28: W160706 21:37:37.263220 gossip/gossip.go:902 first range unavailable or cluster not initialized
ubuntu@104.196.5.28: I160706 21:37:37.692083 gossip/server.go:188 refusing gossip from node 6 (max 3 conns); forwarding to 3 ({tcp cockroach-beta-3:26257})
ubuntu@104.196.5.28: I160706 21:37:38.262145 gossip/gossip.go:926 starting client to 104.196.97.139:26257
ubuntu@104.196.5.28: I160706 21:37:38.263148 gossip/client.go:89 closing client to 104.196.97.139:26257: rpc error: code = 2 desc = duplicate connection from node at {tcp cockroach-beta-5:26257}
ubuntu@104.196.5.28: W160706 21:37:38.263176 gossip/gossip.go:902 first range unavailable or cluster not initialized
ubuntu@104.196.5.28: I160706 21:37:39.262417 gossip/gossip.go:926 starting client to 104.196.24.126:26257
ubuntu@104.196.5.28: I160706 21:37:44.730284 gossip/server.go:188 refusing gossip from node 6 (max 3 conns); forwarding to 1 ({tcp cockroach-beta-1:26257})
ubuntu@104.196.5.28: I160706 21:37:51.749279 gossip/server.go:188 refusing gossip from node 6 (max 3 conns); forwarding to 3 ({tcp cockroach-beta-3:26257})
ubuntu@104.196.5.28: I160706 21:37:57.758278 gossip/server.go:188 refusing gossip from node 2 (max 3 conns); forwarding to 3 ({tcp cockroach-beta-3:26257})
ubuntu@104.196.5.28: I160706 21:37:58.771544 gossip/server.go:188 refusing gossip from node 6 (max 3 conns); forwarding to 3 ({tcp cockroach-beta-3:26257})
ubuntu@104.196.5.28: W160706 21:38:00.770932 gossip/gossip.go:902 first range unavailable or cluster not initialized
ubuntu@104.196.5.28: I160706 21:38:00.771034 gossip/gossip.go:926 starting client to 104.196.0.165:26257
ubuntu@104.196.5.28: I160706 21:38:00.772917 gossip/client.go:87 closing client to node 3 (104.196.0.165:26257): received forward from node 3 to 6 (cockroach-beta-2:26257); already have active connection, skipping
ubuntu@104.196.5.28: W160706 21:38:00.772946 gossip/gossip.go:902 first range unavailable or cluster not initialized
ubuntu@104.196.5.28: I160706 21:38:01.771390 gossip/gossip.go:926 starting client to 104.196.5.28:26257
ubuntu@104.196.5.28: I160706 21:38:01.772421 gossip/client.go:87 closing client to node 4 (104.196.5.28:26257): stopping outgoing client to node 4 (104.196.5.28:26257); loopback connection
ubuntu@104.196.5.28: W160706 21:38:01.772443 gossip/gossip.go:902 first range unavailable or cluster not initialized
ubuntu@104.196.5.28: I160706 21:38:02.771596 gossip/gossip.go:926 starting client to 104.196.40.130:26257
ubuntu@104.196.5.28: I160706 21:38:02.773028 gossip/client.go:89 closing client to 104.196.40.130:26257: rpc error: code = 2 desc = duplicate connection from node at {tcp cockroach-beta-5:26257}
indicating that the gossip network was not converging, but the nodes believed they had too many connections to accept new ones. The problem eventually corrected itself. I haven't done a thorough investigation of the logs to see if there are any other clues. The problem lasted from around 21:26 to 21:38
cc @arjunravinarayan
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
S-1-stabilitySevere stability issues that can be fixed by upgrading, but usually don’t resolve by restartingSevere stability issues that can be fixed by upgrading, but usually don’t resolve by restarting