Skip to content

exponential backoff for addrs in the address book#1292

Merged
ebuchman merged 7 commits intodevelopfrom
1125-exp-backoff-for-ensure-peers
Mar 11, 2018
Merged

exponential backoff for addrs in the address book#1292
ebuchman merged 7 commits intodevelopfrom
1125-exp-backoff-for-ensure-peers

Conversation

@melekes
Copy link
Contributor

@melekes melekes commented Mar 8, 2018

Refs #1125

@melekes melekes requested a review from ebuchman as a code owner March 8, 2018 10:03
@melekes melekes force-pushed the 1125-exp-backoff-for-ensure-peers branch from 6c09c69 to 21e2c41 Compare March 8, 2018 10:04
@codecov-io
Copy link

codecov-io commented Mar 8, 2018

Codecov Report

Merging #1292 into develop will decrease coverage by 0.08%.
The diff coverage is 78.12%.

@@             Coverage Diff             @@
##           develop    #1292      +/-   ##
===========================================
- Coverage    60.58%   60.49%   -0.09%     
===========================================
  Files          112      112              
  Lines        10863    10884      +21     
===========================================
+ Hits          6581     6584       +3     
- Misses        3644     3657      +13     
- Partials       638      643       +5
Impacted Files Coverage Δ
rpc/lib/client/ws_client.go 56.95% <100%> (-0.19%) ⬇️
p2p/pex/addrbook.go 63.82% <100%> (+0.83%) ⬆️
p2p/pex/pex_reactor.go 67.28% <75.86%> (+1.02%) ⬆️
consensus/state.go 76.72% <0%> (-1.63%) ⬇️
blockchain/pool.go 69.64% <0%> (-0.72%) ⬇️
consensus/reactor.go 72.26% <0%> (-0.57%) ⬇️
p2p/pex/file.go 78.04% <0%> (+4.87%) ⬆️

Copy link
Contributor

@xla xla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 :octocat: :shipit: 🍡

// Dial picked addresses
for _, item := range toDial {
for _, addr := range toDial {
go func(picked *p2p.NetAddress) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With ensurePeers growing further maybe we can break this function out and make it a private method on PEXReactor as it only seemed to depend on the reactor itself and the *p2p.NetAddress.

var attempt int
if lAttempt, notFirstAttempt := r.attemptsToDial.Load(picked.DialString()); notFirstAttempt {
attempt = lAttempt.(int)
jitterSeconds := time.Duration(rand.Float64() * billionNs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't billionNs equivalent to float64(time.Second) here?

}
// record attempt
r.attemptsToDial.Store(picked.DialString(), attempt+1)
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else is unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how so? there is no return or break

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I assumed you break out in the if condition.

go func(picked *p2p.NetAddress) {
// exponential backoff if it's not our first attempt to dial picked address
var attempt int
if lAttempt, notFirstAttempt := r.attemptsToDial.Load(picked.DialString()); notFirstAttempt {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor naming nits. Would call attempt, attempts here as we store the amount of times we attempted to connect. And notFirstAttempt -> attempted for more concise and inverse of the meaning.

}
}(item)
r.dialPeer(picked)
}(addr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we call go r.dialPeer(addr) directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@xla xla Mar 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here addr is passed to dialPeer and evaluate at each iteration and placed on the stack of the routine, similar to how you pass it to the closure at the moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay.. thanks

@melekes melekes force-pushed the 1125-exp-backoff-for-ensure-peers branch from 51f71a6 to 1941b5c Compare March 8, 2018 12:31
// exponential backoff if it's not our first attempt to dial given address
var attempts int
if lAttempts, attempted := r.attemptsToDial.Load(addr.DialString()); attempted {
attempts = lAttempts.(int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test this behaviour? I think this branch isn't covered with our current tests. It might also surface that we want to make the jitter duration configurable in some form.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about now? f85c889

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful and simple!

@xla xla mentioned this pull request Mar 8, 2018
2 tasks
var attempts int
if lAttempts, attempted := r.attemptsToDial.Load(addr.DialString()); attempted {
attempts = lAttempts.(int)
jitterSeconds := time.Duration(rand.Float64() * billionNs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

billionNs is the same as float64(time.Second), don't think we need that extra variable, but we might benefit from having it configurable.

Copy link
Contributor

@xla xla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 :octocat: :shipit: 🍡

Copy link
Contributor

@ebuchman ebuchman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great little PR. Thanks @melekes (and especially for the test cleanup).

A few comments that should be addressed, but otherwise I think it's fine.

One thing to note though is that if we have a lot of peers in the AddrBook, it could be a long time before we retry the same peer, in which case the exponentional backoff may not make sense. So we might want to track a more complex structure that includes the last time we dialed the peer so we can check if enough time has lapsed. Doesn't necessarily need to be in this PR but let's at least open an issue to track that if you're going to merge this.


// exponential backoff if it's not our first attempt to dial given address
var attempts int
if lAttempts, attempted := r.attemptsToDial.Load(addr.DialString()); attempted {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just use r.AttemptsToDial() here ?

attempts = lAttempts.(int)
jitterSeconds := time.Duration(rand.Float64() * billionNs)
backoffDuration := jitterSeconds + ((1 << uint(attempts)) * time.Second)
r.Logger.Debug(fmt.Sprintf("Dialing %v", addr), "attempts", attempts, "backoff_duration", backoffDuration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit misleading. Let's just say Sleeping before dialing and including the addr in the structured log too. Then right before we dial we can log "Dialing" with the addr and attempt again.


err := r.Switch.DialPeerWithAddress(addr, false)
if err != nil {
r.Logger.Error("Dialing failed", "err", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include addr and attempt in log

requestsSent *cmn.CMap // ID->struct{}: unanswered send requests
lastReceivedRequests *cmn.CMap // ID->time.Time: last time peer requested from us

attemptsToDial sync.Map // dial addr -> number of attempts to dial (for exponential backoff)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we be more specific with types in this comment?

eg. string->int: number of dial attempts per address; for exponential backoff

jitterSeconds := time.Duration(rand.Float64() * billionNs)
backoffDuration := jitterSeconds + ((1 << uint(attempts)) * time.Second)
r.Logger.Debug(fmt.Sprintf("Dialing %v", addr), "attempts", attempts, "backoff_duration", backoffDuration)
time.Sleep(backoffDuration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have logic that limits the number of attempts per address?

If this backoffDuration gets big, then these go-routines will just hang around for a while. So maybe we should be using a select block here with a ticker so we can break out on Quit ?

@melekes
Copy link
Contributor Author

melekes commented Mar 11, 2018

@ebuchman thanks for suggestion. implemented in 264bce4. PTAL

Copy link
Contributor

@ebuchman ebuchman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks!

We should update the docs.

I think I'll just merge this and open a new issue to mark a few things :)

lastDialed = lAttempts.(_attemptsToDial).lastDialed
}

if attempts > maxAttemptsToDial {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we do have a maxAttempts mechanism already in place: https://github.com/tendermint/tendermint/blob/develop/p2p/pex/known_address.go#L130

Maybe we can consolidate to make sure they're working right together ? feel free to leave this an just open an issue to note the redundancy

backoffDuration := jitterSeconds + ((1 << uint(attempts)) * time.Second)
sinceLastDialed := time.Since(lastDialed)
if sinceLastDialed < backoffDuration {
r.Logger.Debug("Too early to dial", "addr", addr, "backoff_duration", backoffDuration, "last_dialed", lastDialed, "time_since", sinceLastDialed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is cool - it solves the "hanging go-routine" issue but it does mean we might not end up dialing this peer for much longer than the backoff duration. I suppose that's fine

@ebuchman ebuchman merged commit f3000d0 into develop Mar 11, 2018
@ebuchman ebuchman mentioned this pull request Mar 11, 2018
3 tasks
@ebuchman
Copy link
Contributor

See #1304

@ebuchman ebuchman deleted the 1125-exp-backoff-for-ensure-peers branch March 11, 2018 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants