exponential backoff for addrs in the address book#1292
Conversation
6c09c69 to
21e2c41
Compare
Codecov Report
@@ Coverage Diff @@
## develop #1292 +/- ##
===========================================
- Coverage 60.58% 60.49% -0.09%
===========================================
Files 112 112
Lines 10863 10884 +21
===========================================
+ Hits 6581 6584 +3
- Misses 3644 3657 +13
- Partials 638 643 +5
|
p2p/pex/pex_reactor.go
Outdated
| // Dial picked addresses | ||
| for _, item := range toDial { | ||
| for _, addr := range toDial { | ||
| go func(picked *p2p.NetAddress) { |
There was a problem hiding this comment.
With ensurePeers growing further maybe we can break this function out and make it a private method on PEXReactor as it only seemed to depend on the reactor itself and the *p2p.NetAddress.
p2p/pex/pex_reactor.go
Outdated
| var attempt int | ||
| if lAttempt, notFirstAttempt := r.attemptsToDial.Load(picked.DialString()); notFirstAttempt { | ||
| attempt = lAttempt.(int) | ||
| jitterSeconds := time.Duration(rand.Float64() * billionNs) |
There was a problem hiding this comment.
Isn't billionNs equivalent to float64(time.Second) here?
p2p/pex/pex_reactor.go
Outdated
| } | ||
| // record attempt | ||
| r.attemptsToDial.Store(picked.DialString(), attempt+1) | ||
| } else { |
There was a problem hiding this comment.
how so? there is no return or break
There was a problem hiding this comment.
You are right. I assumed you break out in the if condition.
p2p/pex/pex_reactor.go
Outdated
| go func(picked *p2p.NetAddress) { | ||
| // exponential backoff if it's not our first attempt to dial picked address | ||
| var attempt int | ||
| if lAttempt, notFirstAttempt := r.attemptsToDial.Load(picked.DialString()); notFirstAttempt { |
There was a problem hiding this comment.
Some minor naming nits. Would call attempt, attempts here as we store the amount of times we attempted to connect. And notFirstAttempt -> attempted for more concise and inverse of the meaning.
p2p/pex/pex_reactor.go
Outdated
| } | ||
| }(item) | ||
| r.dialPeer(picked) | ||
| }(addr) |
There was a problem hiding this comment.
Can't we call go r.dialPeer(addr) directly?
There was a problem hiding this comment.
There was a problem hiding this comment.
Here addr is passed to dialPeer and evaluate at each iteration and placed on the stack of the routine, similar to how you pass it to the closure at the moment.
51f71a6 to
1941b5c
Compare
p2p/pex/pex_reactor.go
Outdated
| // exponential backoff if it's not our first attempt to dial given address | ||
| var attempts int | ||
| if lAttempts, attempted := r.attemptsToDial.Load(addr.DialString()); attempted { | ||
| attempts = lAttempts.(int) |
There was a problem hiding this comment.
Can we test this behaviour? I think this branch isn't covered with our current tests. It might also surface that we want to make the jitter duration configurable in some form.
p2p/pex/pex_reactor.go
Outdated
| var attempts int | ||
| if lAttempts, attempted := r.attemptsToDial.Load(addr.DialString()); attempted { | ||
| attempts = lAttempts.(int) | ||
| jitterSeconds := time.Duration(rand.Float64() * billionNs) |
There was a problem hiding this comment.
billionNs is the same as float64(time.Second), don't think we need that extra variable, but we might benefit from having it configurable.
ebuchman
left a comment
There was a problem hiding this comment.
Great little PR. Thanks @melekes (and especially for the test cleanup).
A few comments that should be addressed, but otherwise I think it's fine.
One thing to note though is that if we have a lot of peers in the AddrBook, it could be a long time before we retry the same peer, in which case the exponentional backoff may not make sense. So we might want to track a more complex structure that includes the last time we dialed the peer so we can check if enough time has lapsed. Doesn't necessarily need to be in this PR but let's at least open an issue to track that if you're going to merge this.
|
|
||
| // exponential backoff if it's not our first attempt to dial given address | ||
| var attempts int | ||
| if lAttempts, attempted := r.attemptsToDial.Load(addr.DialString()); attempted { |
There was a problem hiding this comment.
can we just use r.AttemptsToDial() here ?
p2p/pex/pex_reactor.go
Outdated
| attempts = lAttempts.(int) | ||
| jitterSeconds := time.Duration(rand.Float64() * billionNs) | ||
| backoffDuration := jitterSeconds + ((1 << uint(attempts)) * time.Second) | ||
| r.Logger.Debug(fmt.Sprintf("Dialing %v", addr), "attempts", attempts, "backoff_duration", backoffDuration) |
There was a problem hiding this comment.
I think this is a bit misleading. Let's just say Sleeping before dialing and including the addr in the structured log too. Then right before we dial we can log "Dialing" with the addr and attempt again.
p2p/pex/pex_reactor.go
Outdated
|
|
||
| err := r.Switch.DialPeerWithAddress(addr, false) | ||
| if err != nil { | ||
| r.Logger.Error("Dialing failed", "err", err) |
There was a problem hiding this comment.
include addr and attempt in log
p2p/pex/pex_reactor.go
Outdated
| requestsSent *cmn.CMap // ID->struct{}: unanswered send requests | ||
| lastReceivedRequests *cmn.CMap // ID->time.Time: last time peer requested from us | ||
|
|
||
| attemptsToDial sync.Map // dial addr -> number of attempts to dial (for exponential backoff) |
There was a problem hiding this comment.
can we be more specific with types in this comment?
eg. string->int: number of dial attempts per address; for exponential backoff
p2p/pex/pex_reactor.go
Outdated
| jitterSeconds := time.Duration(rand.Float64() * billionNs) | ||
| backoffDuration := jitterSeconds + ((1 << uint(attempts)) * time.Second) | ||
| r.Logger.Debug(fmt.Sprintf("Dialing %v", addr), "attempts", attempts, "backoff_duration", backoffDuration) | ||
| time.Sleep(backoffDuration) |
There was a problem hiding this comment.
Do we have logic that limits the number of attempts per address?
If this backoffDuration gets big, then these go-routines will just hang around for a while. So maybe we should be using a select block here with a ticker so we can break out on Quit ?
ebuchman
left a comment
There was a problem hiding this comment.
Great thanks!
We should update the docs.
I think I'll just merge this and open a new issue to mark a few things :)
| lastDialed = lAttempts.(_attemptsToDial).lastDialed | ||
| } | ||
|
|
||
| if attempts > maxAttemptsToDial { |
There was a problem hiding this comment.
Looks like we do have a maxAttempts mechanism already in place: https://github.com/tendermint/tendermint/blob/develop/p2p/pex/known_address.go#L130
Maybe we can consolidate to make sure they're working right together ? feel free to leave this an just open an issue to note the redundancy
| backoffDuration := jitterSeconds + ((1 << uint(attempts)) * time.Second) | ||
| sinceLastDialed := time.Since(lastDialed) | ||
| if sinceLastDialed < backoffDuration { | ||
| r.Logger.Debug("Too early to dial", "addr", addr, "backoff_duration", backoffDuration, "last_dialed", lastDialed, "time_since", sinceLastDialed) |
There was a problem hiding this comment.
this is cool - it solves the "hanging go-routine" issue but it does mean we might not end up dialing this peer for much longer than the backoff duration. I suppose that's fine
|
See #1304 |
Refs #1125