Skip to content

PeerSet contains dirty peer data cause reconnection failed #3304

@unclezoro

Description

@unclezoro

Description:
The topology is:
A seed node connected to serval validator nodes(Exactly two nodes).
Two validator nodes is persistent node of seed node.
After running for serval minutes, seed nodes lose all connection to validator, and try to reconnect,
but all failed, and seed node consensus no blocks any more.

Steps To Reproduce:

  1. Deploy a validator network.
  2. Deploy a seed node(seed_mode is true), connected to core network.
  3. Wait for some time or just use iptable to produce network joggle.

Supporting Material/References:

We check the validator logs:

I[2019-02-01|13:32:37.737] Starting Peer                                module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@104.193.88.123:27146 impl="Peer{MConn{10.201.41.209:2494} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:32:37.737] Starting MConnection                         module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@104.193.88.123:27146 impl=MConn{10.201.41.209:2494}
I[2019-02-01|13:32:37.739] Added peer                                   module=p2p peer="Peer{MConn{10.201.41.209:2494} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
E[2019-02-01|13:32:38.125] Stopping peer for error                      module=p2p peer="Peer{MConn{10.201.41.209:2494} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}" err="Peer (77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e) sent next PEX request too soon. lastReceived: 2019-02-01 13:32:38.016135243 +0000 UTC m=+41.575722596, now: 2019-02-01 13:32:38.125079205 +0000 UTC m=+41.684666557, minInterval: 10s. Disconnecting"
I[2019-02-01|13:32:38.125] Stopping Peer                                module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@104.193.88.123:27146 impl="Peer{MConn{10.201.41.209:2494} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:32:38.125] Stopping MConnection                         module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@104.193.88.123:27146 impl=MConn{10.201.41.209:2494}
I[2019-02-01|13:32:38.133] Stopping gossipDataRoutine for peer          module=consensus peer="Peer{MConn{10.201.41.209:2494} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:32:38.134] Stopping gossipVotesRoutine for peer         module=consensus peer="Peer{MConn{10.201.41.209:2494} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:32:38.930] Starting Peer                                module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.139:27146 impl="Peer{MConn{10.201.43.54:2695} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:32:38.930] Starting MConnection                         module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.139:27146 impl=MConn{10.201.43.54:2695}
I[2019-02-01|13:32:39.157] Added peer                                   module=p2p peer="Peer{MConn{10.201.43.54:2695} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:32:39.159] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:32:39.737] Stopping queryMaj23Routine for peer          module=consensus peer="Peer{MConn{10.201.41.209:2494} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:33:06.788] Dialing peer                                 module=p2p address=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@52.196.133.127:27146
E[2019-02-01|13:33:06.789] Dialing failed                               module=pex addr=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@52.196.133.127:27146 err="auth failure: secrect conn failed: EOF" attempts=0
I[2019-02-01|13:33:09.127] Starting Peer                                module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.151:27146 impl="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:33:09.127] Starting MConnection                         module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.151:27146 impl=MConn{10.201.42.180:2682}
I[2019-02-01|13:33:09.227] Stopping MConnection                         module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.151:27146 impl=MConn{10.201.42.180:2682}
E[2019-02-01|13:33:09.227] Stopping peer for error                      module=p2p peer="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}" err="Peer Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in} has no state"
I[2019-02-01|13:33:09.227] Stopping Peer                                module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.151:27146 impl="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:33:09.402] Added peer                                   module=p2p peer="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:33:09.403] Stopping queryMaj23Routine for peer          module=consensus peer="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:33:09.403] Stopping gossipVotesRoutine for peer         module=consensus peer="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:33:09.403] Stopping gossipDataRoutine for peer          module=consensus peer="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:33:09.570] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:09.836] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:10.103] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:10.367] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:10.639] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:11.178] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:11.448] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:11.719] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:11.982] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:12.245] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:12.288] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:12.509] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:12.784] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e
I[2019-02-01|13:33:13.046] Ignoring inbound connection: error while adding peer module=p2p err="duplicate ID<77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e>" id=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e

We can see the incoming connection is disconnected by validator since it report duplicate ID.
We can find that the sequence is not correct, Added peer is after Stopping Peer:

I[2019-02-01|13:33:09.127] Starting MConnection                         module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.151:27146 impl=MConn{10.201.42.180:2682}
I[2019-02-01|13:33:09.227] Stopping MConnection                         module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.151:27146 impl=MConn{10.201.42.180:2682}
E[2019-02-01|13:33:09.227] Stopping peer for error                      module=p2p peer="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}" err="Peer Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in} has no state"
I[2019-02-01|13:33:09.227] Stopping Peer                                module=p2p peer=77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e@119.63.197.151:27146 impl="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"
I[2019-02-01|13:33:09.402] Added peer                                   module=p2p peer="Peer{MConn{10.201.42.180:2682} 77d3f2ee9b5edfd32ffc017e7e8f9e4a1dc1457e in}"

Refer to the code, sw.peers.Add(p) happens after sw.startInitPeer(p):

func (sw *Switch) addPeer(p Peer) error {
	if err := sw.filterPeer(p); err != nil {
		return err
	}

	p.SetLogger(sw.Logger.With("peer", p.NodeInfo().NetAddress()))

	// Handle the shut down case where the switch has stopped but we're
	// concurrently trying to add a peer.
	if sw.IsRunning() {
		// All good. Start peer
		if err := sw.startInitPeer(p); err != nil {
			return err
		}
	} else {
		sw.Logger.Error("Won't start a peer - switch is not running", "peer", p)
	}

	// Add the peer to .peers.
	// We start it first so that a peer in the Alist is safe to Stop.
	// It should not err since we already checked peers.Has().
	if err := sw.peers.Add(p); err != nil {
		return err
	}

	sw.Logger.Info("Added peer", "peer", p)
	sw.metrics.Peers.Add(float64(1))
	return nil

We know sw.startInitPeer(p) will enable every reactor to send/receive message with peer.
In the log our peer get wrong message and disconnect the peer by callling StopPeerForError:

func(sw*Switch)StopPeerForError(peerPeer,reasoninterface{}){
sw.Logger.Error("Stoppingpeerforerror","peer",peer,"err",reason,"IsPersistent",peer.IsPersistent())
sw.stopAndRemovePeer(peer,reason)

ifpeer.IsPersistent(){
//FIXME:persistentpeerscan'tbeinboundrightnow.
//self-reportedaddressforinboundpersistentpeers
addr:=peer.NodeInfo().NetAddress()
gosw.reconnectToPeer(addr)
}
}

Which will

func(ps*PeerSet)Remove(peerPeer)bool{
ps.mtx.Lock()
deferps.mtx.Unlock()
//fmt.Printf("===debugpeersetrmid%v\n",peer.ID())
item:=ps.lookup[peer.ID()]
ifitem==nil{
returnfalse
}

But the peer is still not added to PeerSet, so it remove nothing.

Then in the func (sw *Switch) addPeer(p Peer) will add the peer to peerset.
Nexttime the peer try to connect to our peer, it will be rejected. That is why we see duplicate ID error

In the log, we see no state error, and the mconnection goroutine panic.
The reason why get "no state" error:
The state is set:

func (conR *ConsensusReactor) AddPeer(peer p2p.Peer) {
	if !conR.IsRunning() {
		return
	}

	// Create peerState for peer
	peerState := NewPeerState(peer).SetLogger(conR.Logger)
	peer.Set(types.PeerStateKey, peerState)

	// Begin routines for this peer.
	go conR.gossipDataRoutine(peer, peerState)
	go conR.gossipVotesRoutine(peer, peerState)
	go conR.queryMaj23Routine(peer, peerState)

	// Send our state to peer.
	// If we're fast_syncing, broadcast a RoundStepMessage later upon SwitchToConsensus().
	if !conR.FastSync() {
		conR.sendNewRoundStepMessage(peer)
	}
}

but mconnection is starting before consensus reactor staring:

func (sw *Switch) startInitPeer(p Peer) error {
	err := p.Start() // spawn send/recv routines
	if err != nil {
		// Should never happen
		sw.Logger.Error(
			"Error starting peer",
			"err", err,
			"peer", p,
		)
		return err
	}

	for _, reactor := range sw.reactors {
		reactor.AddPeer(p)
	}

	return nil
}

Onec mconnection get consensus message and AddPeer is not completed, error happened

Metadata

Metadata

Assignees

No one assigned

    Labels

    C:p2pComponent: P2P pkgT:bugType Bug (Confirmed)

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions