-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Description:
The topology is:
A seed node connected to serval validator nodes(Exactly two nodes).
Two validator nodes is persistent node of seed node.
After running for serval minutes, seed nodes lose all connection to validator, and try to reconnect,
but all failed, and seed node consensus no blocks any more.
Steps To Reproduce:
- Deploy a validator network.
- Deploy a seed node(seed_mode is true), connected to core network.
- Wait for some time or just use
iptableto produce network joggle.
Supporting Material/References:
We check the validator logs:
I[2019-02-01|13:32:37.737] Starting Peer module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@104.193.88.123:27146 impl="Peer{MConn{10.201.41.209:2494} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:32:37.737] Starting MConnection module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@104.193.88.123:27146 impl=MConn{10.201.41.209:2494}
I[2019-02-01|13:32:37.739] Added peer module=p2p peer="Peer{MConn{10.201.41.209:2494} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
E[2019-02-01|13:32:38.125] Stopping peer for error module=p2p peer="Peer{MConn{10.201.41.209:2494} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}" err="Peer (77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e) sent next PEX request too soon. lastReceived: 2019-02-01 13:32:38.016135243 +0000 UTC m=+41.575722596, now: 2019-02-01 13:32:38.125079205 +0000 UTC m=+41.684666557, minInterval: 10s. Disconnecting"
I[2019-02-01|13:32:38.125] Stopping Peer module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@104.193.88.123:27146 impl="Peer{MConn{10.201.41.209:2494} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:32:38.125] Stopping MConnection module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@104.193.88.123:27146 impl=MConn{10.201.41.209:2494}
I[2019-02-01|13:32:38.133] Stopping gossipDataRoutine for peer module=consensus peer="Peer{MConn{10.201.41.209:2494} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:32:38.134] Stopping gossipVotesRoutine for peer module=consensus peer="Peer{MConn{10.201.41.209:2494} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:32:38.930] Starting Peer module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.139:27146 impl="Peer{MConn{10.201.43.54:2695} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:32:38.930] Starting MConnection module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.139:27146 impl=MConn{10.201.43.54:2695}
I[2019-02-01|13:32:39.157] Added peer module=p2p peer="Peer{MConn{10.201.43.54:2695} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:32:39.159] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:32:39.737] Stopping queryMaj23Routine for peer module=consensus peer="Peer{MConn{10.201.41.209:2494} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:33:06.788] Dialing peer module=p2p address=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@52.196.133.127:27146
E[2019-02-01|13:33:06.789] Dialing failed module=pex addr=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@52.196.133.127:27146 err="auth failure: secrect conn failed: EOF" attempts=0
I[2019-02-01|13:33:09.127] Starting Peer module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.151:27146 impl="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:33:09.127] Starting MConnection module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.151:27146 impl=MConn{10.201.42.180:2682}
I[2019-02-01|13:33:09.227] Stopping MConnection module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.151:27146 impl=MConn{10.201.42.180:2682}
E[2019-02-01|13:33:09.227] Stopping peer for error module=p2p peer="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}" err="Peer Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in} has no state"
I[2019-02-01|13:33:09.227] Stopping Peer module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.151:27146 impl="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:33:09.402] Added peer module=p2p peer="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:33:09.403] Stopping queryMaj23Routine for peer module=consensus peer="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:33:09.403] Stopping gossipVotesRoutine for peer module=consensus peer="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:33:09.403] Stopping gossipDataRoutine for peer module=consensus peer="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:33:09.570] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:09.836] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:10.103] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:10.367] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:10.639] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:11.178] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:11.448] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:11.719] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:11.982] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:12.245] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:12.288] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:12.509] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:12.784] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:13.046] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
We can see the incoming connection is disconnected by validator since it report duplicate ID.
We can find that the sequence is not correct, Added peer is after Stopping Peer:
I[2019-02-01|13:33:09.127] Starting MConnection module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.151:27146 impl=MConn{10.201.42.180:2682}
I[2019-02-01|13:33:09.227] Stopping MConnection module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.151:27146 impl=MConn{10.201.42.180:2682}
E[2019-02-01|13:33:09.227] Stopping peer for error module=p2p peer="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}" err="Peer Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in} has no state"
I[2019-02-01|13:33:09.227] Stopping Peer module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.151:27146 impl="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:33:09.402] Added peer module=p2p peer="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
Refer to the code, sw.peers.Add(p) happens after sw.startInitPeer(p):
func (sw *Switch) addPeer(p Peer) error {
if err := sw.filterPeer(p); err != nil {
return err
}
p.SetLogger(sw.Logger.With("peer", p.NodeInfo().NetAddress()))
// Handle the shut down case where the switch has stopped but we're
// concurrently trying to add a peer.
if sw.IsRunning() {
// All good. Start peer
if err := sw.startInitPeer(p); err != nil {
return err
}
} else {
sw.Logger.Error("Won't start a peer - switch is not running", "peer", p)
}
// Add the peer to .peers.
// We start it first so that a peer in the Alist is safe to Stop.
// It should not err since we already checked peers.Has().
if err := sw.peers.Add(p); err != nil {
return err
}
sw.Logger.Info("Added peer", "peer", p)
sw.metrics.Peers.Add(float64(1))
return nil
We know sw.startInitPeer(p) will enable every reactor to send/receive message with peer.
In the log our peer get wrong message and disconnect the peer by callling StopPeerForError:
func(sw*Switch)StopPeerForError(peerPeer,reasoninterface{}){
sw.Logger.Error("Stoppingpeerforerror","peer",peer,"err",reason,"IsPersistent",peer.IsPersistent())
sw.stopAndRemovePeer(peer,reason)
ifpeer.IsPersistent(){
//FIXME:persistentpeerscan'tbeinboundrightnow.
//self-reportedaddressforinboundpersistentpeers
addr:=peer.NodeInfo().NetAddress()
gosw.reconnectToPeer(addr)
}
}
Which will
func(ps*PeerSet)Remove(peerPeer)bool{
ps.mtx.Lock()
deferps.mtx.Unlock()
//fmt.Printf("===debugpeersetrmid%v\n",peer.ID())
item:=ps.lookup[peer.ID()]
ifitem==nil{
returnfalse
}
But the peer is still not added to PeerSet, so it remove nothing.
Then in the func (sw *Switch) addPeer(p Peer) will add the peer to peerset.
Nexttime the peer try to connect to our peer, it will be rejected. That is why we see duplicate ID error
In the log, we see no state error, and the mconnection goroutine panic.
The reason why get "no state" error:
The state is set:
func (conR *ConsensusReactor) AddPeer(peer p2p.Peer) {
if !conR.IsRunning() {
return
}
// Create peerState for peer
peerState := NewPeerState(peer).SetLogger(conR.Logger)
peer.Set(types.PeerStateKey, peerState)
// Begin routines for this peer.
go conR.gossipDataRoutine(peer, peerState)
go conR.gossipVotesRoutine(peer, peerState)
go conR.queryMaj23Routine(peer, peerState)
// Send our state to peer.
// If we're fast_syncing, broadcast a RoundStepMessage later upon SwitchToConsensus().
if !conR.FastSync() {
conR.sendNewRoundStepMessage(peer)
}
}
but mconnection is starting before consensus reactor staring:
func (sw *Switch) startInitPeer(p Peer) error {
err := p.Start() // spawn send/recv routines
if err != nil {
// Should never happen
sw.Logger.Error(
"Error starting peer",
"err", err,
"peer", p,
)
return err
}
for _, reactor := range sw.reactors {
reactor.AddPeer(p)
}
return nil
}
Onec mconnection get consensus message and AddPeer is not completed, error happened