Skip to content

statesync: implement p2p state provider#6807

Merged
cmwaters merged 38 commits intomasterfrom
callum/p2p-provider
Sep 2, 2021
Merged

statesync: implement p2p state provider#6807
cmwaters merged 38 commits intomasterfrom
callum/p2p-provider

Conversation

@cmwaters
Copy link
Contributor

@cmwaters cmwaters commented Aug 9, 2021

Closes: #6491

@cmwaters
Copy link
Contributor Author

Update: I'm stumped with a bug caught in the e2e tests in which the following events happen:

  1. NodeA starts and after 5 blocks NodeB starts with state sync turned on
  2. NodeB connects to NodeA and initializes the p2p state provider
  3. The P2P state provider (which wraps a light client) requests on start up a light block at the NodeB's trusted height to confirm that the NodeA's block has the same hash
  4. NodeA responds with the light block, NodeB validates it and the p2p state provider is now initialized and ready to go
  5. NodeB inits the syncer, this sends out a snapshot request to NodeA.
  6. NodeA responds with a valid snapshot
  7. NodeB receives the snapshot and goes to add it to the pool. Before doing that, it uses the stateprovider to fetch the apphash at the height of the snapshot
  8. The state provider sends a request at that height and awaits a response
  9. NodeA receives the light block request at that height and returns the light block
  10. NodeB never receives the light block. It halts and eventually times out, rejecting the snapshot and then sending a new request for snapshots although is never able to receive a light block from NodeA in which to get the app hash.

Other things of Note:

  • The priority of the light block channel is higher than any other channel so messages shouldn't be dropped
  • This works using the RPC state provider. When we go to backfill the respective amount of blocks, this functionality still works which tells us that the dispatcher responsible for requesting and receiving light blocks works fine.
  • I have written a unit test that does this entire sync operation using the p2p provider and it passes without blocking
  • I have added a ton of extra logs to help deduce what is happening

Suspects:

  • There is a fair bit of concurrency in the state sync reactor as well as the use of mutex's. It's feasible then that something somewhere is deadlocking.
  • It's hard to imagine the problem arising outside of the statesync reactor, because everything was working fine (in the p2p layer) beforehand

@codecov
Copy link

codecov bot commented Aug 27, 2021

Codecov Report

Merging #6807 (590d021) into master (511bd3e) will increase coverage by 0.27%.
The diff coverage is 47.64%.

@@            Coverage Diff             @@
##           master    #6807      +/-   ##
==========================================
+ Coverage   62.38%   62.66%   +0.27%     
==========================================
  Files         310      310              
  Lines       40616    40776     +160     
==========================================
+ Hits        25340    25552     +212     
+ Misses      13476    13410      -66     
- Partials     1800     1814      +14     
Impacted Files Coverage Δ
config/toml.go 68.33% <ø> (ø)
internal/consensus/reactor.go 70.97% <0.00%> (+4.45%) ⬆️
test/e2e/generator/generate.go 0.00% <0.00%> (ø)
node/node.go 49.33% <9.09%> (+0.26%) ⬆️
config/config.go 68.57% <11.11%> (+0.39%) ⬆️
internal/statesync/stateprovider.go 29.14% <18.89%> (+29.14%) ⬆️
proto/tendermint/statesync/message.go 57.95% <23.52%> (-2.61%) ⬇️
internal/statesync/syncer.go 75.61% <24.39%> (-1.61%) ⬇️
internal/test/factory/p2p.go 30.00% <30.00%> (ø)
internal/statesync/reactor.go 72.57% <70.85%> (+13.33%) ⬆️
... and 26 more

@cmwaters cmwaters marked this pull request as ready for review August 27, 2021 13:01
@cmwaters
Copy link
Contributor Author

Ok I've managed to resolve this issue and it seems to be passing ci.toml. I have seen the occasional failure happen when a statesynced node restarts and then replays prior blocks. It seems to get stuck in the process. I have a feeling this is somewhat orthogonal though.

// dispatcher multiplexes concurrent requests by multiple peers for light blocks.
// Only one request per peer can be sent at a time
// NOTE: It is not the responsibility of the dispatcher to verify the light blocks.
type Dispatcher struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: We can make this and BlockProvider private if we want to. The reasons for making it exported is that it's in an internal folder and we plan to reuse these two structs within the light client. Although I'm indifferent either way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we go to reuse these in the light client, should they just be moved to the light client instead?

Comment on lines +59 to 62
message ParamsResponse {
uint64 height = 1;
tendermint.types.ConsensusParams consensus_params = 2 [(gogoproto.nullable) = false];
} No newline at end of file
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The height field here is currently unnecessary because we only ever have a single request per peer so we know the height of the consensus params when we receive it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought about allowing consensus_params to be nullable (like I've done in the light block response) so a node can send a null consensus param to signal that they don't have it at that height.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this field mean to represent the height at which those params apply? I.e. if this field is 13 that means that these were the params at height 13? If that's the case, I think that this field would be helpful for debugging so that clients can easily tell what height they are receiving a response for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this field mean to represent the height at which those params apply?

Exactly, but even without this field, the state provider knows which height it is expecting the consensus params from the peer to be so it should still be able to log it.

Copy link
Contributor

@tychoish tychoish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I'm super excited for it.

I think we should backport the changes to syncer.go to 0.34. (and maybe,

I also think there are a lot of TODO items in the node.go file that should probably be called out in an issue as I think most of that construction/orchestration code should move into the statesync package.

I'm also wary of the ways that this relies so heavily on the e2e tests for coverage, but I think it's probably find in the end.

func (s *stateProviderRPC) Commit(ctx context.Context, height uint64) (*types.Commit, error) {
s.Lock()
defer s.Unlock()
header, err := s.lc.VerifyLightBlockAtHeight(ctx, int64(height), time.Now())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code, from what I understand, is fetching a "commit". Commit contains much of the meaningful data about a block including the BlockID and set of signatures for the block:

type Commit struct {

nextLightBlock.Height, err)
}
state.ConsensusParams = result.ConsensusParams
state.LastHeightConsensusParamsChanged = currentLightBlock.Height
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this actually represents the last time that the params changed for the blockchain. This line just sets the LastHeightConsensusParamsChanged to the current height regardless of how long ago the blockchain actually changed params. Do we rely on this value to tell us when the blockchain changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a local value to the node used as an optimization. Since ConsensusParams hardly changes we shouldn't persist them every height but only when they do change. All other heights should just point to where the last change was. This field is used for that. Since we're starting a fresh Tendermint instance we set the LastHeightConsensusParamsChanged to the height we start at.

@cmwaters
Copy link
Contributor Author

Does node B know the hash in advance? Where does it get the information for what the hash at the trusted height should be?

Yes this is part of the trusted options

type TrustOptions struct {

My main feeling is that the abstraction we use for p2p (the channels) makes it a bit difficult for us to easily separate functionality and responsibility where a peer may be involved. Notably, I'm not totally clear on why a piece of code that is responsible for fetching application state also needs to be responsible for providing application state to other peers that may want it. I'm not sure that it makes sense to update all of this right now, but we should endeavor to separate the application functionality as much as possible whenever we use these p2p channels.

I brought this up in the node initialization ADR but in short I do agree that reactors such as statesync and blocksync which have a request response model (and not a gossip to everyone model like consensus and mempool), could benefit from a separation between fetching and providing.

@cmwaters
Copy link
Contributor Author

I just wanted to share some of my thinking about future works here as this may be helpful to @williambanfield and @creachadair, as my reviewers, to get a sense of my intention and if necessary help steer me to a better design. The motivation of having this dispatcher abstraction was to find a way to link the p2p model with how the light client uses providers. At some, yet to be determined, time in the future, we will most likely move a lot of this code into light/provider/p2p which will be its own reactor that can be used to support the light client. The statesync reactor will thus call upon this new reactor to implement StateProvider and to fetch light blocks for the Backfill method. The more I've delved into this the more that I've come across incongruences.

As William has already pointed out, the p2p channel model obliges reactors to take responsibility of both fetching and supplying the relevant data over these channels. The light provider implementation will most likely be "request only" (light clients keep very little data so I don't think they would be valuable as suppliers of light blocks). Hence, a problem is trying to construct infrastructure for both users running light clients that just want to request and verify the relevant data, and for full nodes that run the light reactor to support light clients and other nodes running state sync.

At the moment, the dispatcher encapsulates this "request only" pattern, and the light channel is thus split into supporting the dispatchers requests as well as being wired directly to the state and block store which is necessary to "supply" light blocks.

lb, err := r.dispatcher.LightBlock(ctxWithCancel, height, peer)
subCtx, cancel := context.WithTimeout(ctxWithCancel, lightBlockResponseTimeout)
defer cancel()
lb, err := r.dispatcher.LightBlock(subCtx, height, peer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love this change!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. My only remaining concern is that a defer inside a loop stacks up state until the enclosing function returns. We can hack that with a closure (destructors, Go style), e.g.,

subctx, cancel := context.WithTimeout(ctx, lightBlockResponseTimeout)
lb, err := func() (*types.LightBlock, error) {
  defer cancel()
  return r.dispatcher.LightBlock(subctx, height, peer)
}()

or by explicitly calling it after the call but before we check the errors. The latter is actually easier, but the attraction of a defer for happening even in case of a panic is understandable.

Copy link

@creachadair creachadair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my remaining comments are optional and cosmetic. I do have a few questions that I think are worth considering, but I don't see anything I think needs to block merging unless the answers are more complicated than I realized.

@@ -884,15 +884,46 @@ func (cfg *MempoolConfig) ValidateBasic() error {

// StateSyncConfig defines the configuration for the Tendermint state sync service
type StateSyncConfig struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, these comments are a really great improvement! 🎉

// with net.Dial, for example: "host.example.com:2125"
RPCServers []string `mapstructure:"rpc-servers"`

// The hash and height of a trusted block. Must be within the trust-period.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "within the trust period" does that mean the block's commit timestamp has to be within that interval before the moment at which we're doing state sync?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly.

block.Time + trustPeriod has to be greater than time.Now()

// one day less than the unbonding period should suffice.
TrustPeriod time.Duration `mapstructure:"trust-period"`

// Time to spend discovering snapshots before initiating a restore.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "restore" in this context mean the state sync process? Or is a restore what we do if we can't state sync (e.g., because we didn't find any viable snapshots)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restore means the state sync process or specifically the process of transferring chunks of state from one application to another. Basically we wait this time to receive snapshots from peers. We pick the "best" snapshot and then we ask peers for chunks of state that correspond to the snapshot (snapshot is just metadata i.e. height, format, hashes)

Copy link

@creachadair creachadair Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. So the statesync cycle has two phases: Discovery followed by apply, and this duration determines how long we spend in the lobby. I thought this was meant to be a timeout, but I guess it's more of a cutoff? (E.g., we may still have snapshots available, but after this amount of time we'll stop and apply a batch)

What happens if no snapshots are found during the discovery period?

lb, err := r.dispatcher.LightBlock(ctxWithCancel, height, peer)
subCtx, cancel := context.WithTimeout(ctxWithCancel, lightBlockResponseTimeout)
defer cancel()
lb, err := r.dispatcher.LightBlock(subCtx, height, peer)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. My only remaining concern is that a defer inside a loop stacks up state until the enclosing function returns. We can hack that with a closure (destructors, Go style), e.g.,

subctx, cancel := context.WithTimeout(ctx, lightBlockResponseTimeout)
lb, err := func() (*types.LightBlock, error) {
  defer cancel()
  return r.dispatcher.LightBlock(subctx, height, peer)
}()

or by explicitly calling it after the call but before we check the errors. The latter is actually easier, but the attraction of a defer for happening even in case of a panic is understandable.

@faddat
Copy link
Contributor

faddat commented Sep 1, 2021

this is just as exciting as everyone's said it is. It will enable teeny tiny tendermints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

statesync: use p2p state provider as default

6 participants