Skip to content

p2p refactor: state sync reactor#5671

Merged
alexanderbez merged 102 commits intomasterfrom
bez/p2p-refactor-state-sync-reactor
Dec 9, 2020
Merged

p2p refactor: state sync reactor#5671
alexanderbez merged 102 commits intomasterfrom
bez/p2p-refactor-state-sync-reactor

Conversation

@alexanderbez
Copy link
Contributor

@alexanderbez alexanderbez commented Nov 16, 2020

Description

Prepares the state-sync reactor for the newly designed p2p changes per ADR 062.

  • Introduce new p2p types and interfaces.
  • Remove legacy Service implementation for state-sync. It still fulfills the Service interface, but now using the p2p Channel semantics.
  • Cleanup and refactor tests.

ref: #5670

@codecov
Copy link

codecov bot commented Nov 16, 2020

Codecov Report

Merging #5671 (8ee0d8c) into master (89e908e) will increase coverage by 0.06%.
The diff coverage is 59.34%.

@@            Coverage Diff             @@
##           master    #5671      +/-   ##
==========================================
+ Coverage   60.50%   60.57%   +0.06%     
==========================================
  Files         259      261       +2     
  Lines       23345    23618     +273     
==========================================
+ Hits        14126    14306     +180     
- Misses       7743     7829      +86     
- Partials     1476     1483       +7     
Impacted Files Coverage Δ
p2p/shim.go 49.29% <49.29%> (ø)
statesync/chunks.go 86.36% <50.00%> (-0.18%) ⬇️
statesync/reactor.go 54.58% <50.00%> (+14.45%) ⬆️
proto/tendermint/statesync/message.go 72.34% <72.34%> (ø)
node/node.go 58.02% <76.47%> (+0.15%) ⬆️
p2p/peer.go 59.02% <80.00%> (+2.43%) ⬆️
statesync/snapshots.go 91.59% <87.50%> (ø)
statesync/syncer.go 78.96% <88.88%> (+0.48%) ⬆️
p2p/channel.go 93.75% <93.75%> (ø)
crypto/sr25519/pubkey.go 43.47% <0.00%> (-8.70%) ⬇️
... and 16 more

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shim looks great overall! Left a few minor comment.

We'll have to look closer at the details here later, but this is great for now. E.g. we need to look into blocking/buffering/dropping and such - the shim is now dropping messages while I think the current P2P stack assumes it will block. We may want to block for now too, to preserve the current system behavior, until we design the router. But let's do a pass later for stuff like this.

Copy link
Contributor

@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@erikgrinaker
Copy link
Contributor

Occurred to me that we'll need better panic handling here too. In this PR, a panic during message processing causes the entire channel processing goroutine to terminate. In the current P2P stack, a panic causes the peer to be disconnected. We should probably disconnect the peer but otherwise continue message processing, similarly to how a web-server doesn't terminate if a HTTP request panics.

@alexanderbez
Copy link
Contributor Author

Occurred to me that we'll need better panic handling here too. In this PR, a panic during message processing causes the entire channel processing goroutine to terminate. In the current P2P stack, a panic causes the peer to be disconnected. We should probably disconnect the peer but otherwise continue message processing, similarly to how a web-server doesn't terminate if a HTTP request panics.

Agree, but this is dependent on the individual reactors. Specifically, for state-sync, we do not have any explicit panic invocations. However, this doesn't mean a panic can't happen (e.g. nil pointer). Are you suggesting we have a recover block? If so, we should devise a way where each reactor doesn't have to do this themselves (if possible, maybe not).

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Dec 8, 2020

Agree, but this is dependent on the individual reactors. Specifically, for state-sync, we do not have any explicit panic invocations. However, this doesn't mean a panic can't happen (e.g. nil pointer). Are you suggesting we have a recover block? If so, we should devise a way where each reactor doesn't have to do this themselves (if possible, maybe not).

Yeah, it follows from using channels that it's the consumer's responsibility to handle panics and other errors during processing. Since we're receiving messages from random people on the Internet here, we need to be resistant to adversarial inputs, and that includes panic recovery.

There's a bunch of different options here, for example:

  1. Move message processing into a handleMessage(from p2p.PeerID, msg proto.Message) error method and do panic recovery there or in the caller (this also improves rightwards code drift and error handling).

  2. Add a convenience method like Channel.Process(context.Context, func (Channel, p2p.Envelope) error) error that consumes the channel and calls the callback for each message until the context (or closer or whatever) is cancelled, handling panics and other issues.

  3. Keep running the channel processing method in a loop until the reactor is shut down.

These are just a few ideas off the top of my head, there's probably other good options as well. I think this mostly depends on what's most convenient, understandable, and flexible so it's probably a good idea to experiment with a few approaches and see what works best.

@alexanderbez
Copy link
Contributor Author

I went with something that is more like (1) because I do not want to make any premature abstractions or optimizations without knowing how all the reactors will look like. This is something we can cleanup in the future in, most likely, a non-breaking manner.

So now we have a simple handleMessage that handles panic recovery. handleMessage will then handle the message accordingly.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! A couple of nits, otherwise I think this is good to go. 🚀

As we've discussed, the reactor ergonomics with panic handling and such isn't optimal, let's revisit this later. Also, there are probably still potential deadlocks, halts, and impedance mismatches with the current P2P stack, but we've covered most things that I can think of -- we should be on the lookout for these sorts of issues as we do the remaining reactors.

@alexanderbez alexanderbez merged commit a879eb4 into master Dec 9, 2020
@alexanderbez alexanderbez deleted the bez/p2p-refactor-state-sync-reactor branch December 9, 2020 14:31
@alexanderbez alexanderbez mentioned this pull request Dec 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C:p2p Component: P2P pkg

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants