Skip to content

p2p: "peer did not send us anything" in fast_sync mode when under high pressure  #3457

@unclezoro

Description

@unclezoro

Reproduce step:

  1. To make huge pressure to rpc api, first change tendermint code, just increase the lock time so that a small scale of query request can make ApplyBlock a great delay:
    image

  2. Write a shell script, and run it on local node:

while true; do
time  curl 127.0.0.1:26657/{abci-query-for something} &
sleep 0.2
done
  1. Restart node and wait for about 2 or 3 minutes, we will see errors:
    image

To find the root cause, we modified the source code:

  1. Give BlockRequest new field Time
type BlockRequest struct {
	Time   time.Time //add to test
	Height int64
	PeerID p2p.ID
}

and init it with time.Now() when send it to pool.requestsCh

func (pool *BlockPool) sendRequest(height int64, peerID p2p.ID) {
	if !pool.IsRunning() {
		return
	}
	pool.requestsCh <- BlockRequest{time.Now(),height, peerID}
}
  1. In poolRoutine() of BlockchainReactor , we compare the create time and current time.
    image

  2. Then we check the the peer that seems to be sent us nothing, find the block heigh that it is expected to receive but get nothing, we find that the request is still not send out !!! Generate at
    19min42second, but pop from channel at 20min10second.
    image

  3. Just check the code of poolRoutine. we find that trySyncTicker is trigger in every 10ms which will send didProcessCh an object. When process a block takes more that 10ms, didProcessCh alway get an object so that get event from requestsCh will delay.

I did a simple test:

package main

import (
	"fmt"
	"math/rand"
	"time"
)

type Event struct {
	Time time.Time
}

func main() {
	eventChannel := make(chan Event,600)
	didProcessCh := make(chan struct{}, 1)
	trySyncTicker := time.NewTicker(10 * time.Millisecond)
	go func() {
		for{
			t:=rand.Intn(1000)
			time.Sleep(time.Duration(t)*time.Millisecond)
			eventChannel<-Event{time.Now()}
		}
	}()

	for {
		select {
		case e:=<-eventChannel:
			end:=time.Now()
			d:=end.Sub(e.Time)
			fmt.Println(d)
		case <-trySyncTicker.C: // chan time
			select {
			case didProcessCh <- struct{}{}:
			default:
			}
		case <-didProcessCh:
				// We need both to sync the first block.
				// Try again quickly next loop.
				time.Sleep(2*time.Second)
				didProcessCh <- struct{}{}
		}
	}
}

The result is

1.929595741s
3.046203077s
2.199312363s
8.143827024s
12.067358858s
11.749071705s
11.321399465s
10.779326276s
12.323167204s
14.024024162s
13.327849443s
12.816276574s
16.655774867s
16.563461207s
15.832692446s
15.554935313s
21.34267714s
20.893256979s
26.66588s
28.562135556s
34.070507764s
33.602116951s
35.070530314s
34.812306947s
36.765648747s

Metadata

Metadata

Assignees

No one assigned

    Labels

    C:p2pComponent: P2P pkgT:bugType Bug (Confirmed)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions