Skip to content

p2p: intervals when re-dialing persistent peers are not as expected #3519

@cason

Description

@cason

The method used for dialing addresses configured as persistent peers is the following:

cometbft/p2p/switch.go

Lines 391 to 438 in b47d18e

func (sw *Switch) reconnectToPeer(addr *NetAddress) {
if sw.reconnecting.Has(string(addr.ID)) {
return
}
sw.reconnecting.Set(string(addr.ID), addr)
defer sw.reconnecting.Delete(string(addr.ID))
start := time.Now()
sw.Logger.Info("Reconnecting to peer", "addr", addr)
for i := 0; i < reconnectAttempts; i++ {
if !sw.IsRunning() {
return
}
err := sw.DialPeerWithAddress(addr)
if err == nil {
return // success
} else if _, ok := err.(ErrCurrentlyDialingOrExistingAddress); ok {
return
}
sw.Logger.Info("Error reconnecting to peer. Trying again", "tries", i, "err", err, "addr", addr)
// sleep a set amount
sw.randomSleep(reconnectInterval)
continue
}
sw.Logger.Error("Failed to reconnect to peer. Beginning exponential backoff",
"addr", addr, "elapsed", time.Since(start))
for i := 0; i < reconnectBackOffAttempts; i++ {
if !sw.IsRunning() {
return
}
// sleep an exponentially increasing amount
sleepIntervalSeconds := math.Pow(reconnectBackOffBaseSeconds, float64(i))
sw.randomSleep(time.Duration(sleepIntervalSeconds) * time.Second)
err := sw.DialPeerWithAddress(addr)
if err == nil {
return // success
} else if _, ok := err.(ErrCurrentlyDialingOrExistingAddress); ok {
return
}
sw.Logger.Info("Error reconnecting to peer. Trying again", "tries", i, "err", err, "addr", addr)
}
sw.Logger.Error("Failed to reconnect to peer. Giving up", "addr", addr, "elapsed", time.Since(start))
}

According with the comments in the code, and some of our documentation, the total duration of re-connection attempts should be around 1 day:

cometbft/p2p/switch.go

Lines 23 to 31 in b47d18e

// repeatedly try to reconnect for a few minutes
// ie. 5 * 20 = 100s.
reconnectAttempts = 20
reconnectInterval = 5 * time.Second
// then move into exponential backoff mode for ~1day
// ie. 3**10 = 16hrs.
reconnectBackOffAttempts = 10
reconnectBackOffBaseSeconds = 3

But, from this snap of the code, derived from the previous one, the total interval between the configured attempts (30) is actually around 8h14m: https://go.dev/play/p/JWfU6lerps5.

Namely, the expected behavior does not match the implemented behavior.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions