statesync: implement p2p state provider#6807
Conversation
|
Update: I'm stumped with a bug caught in the e2e tests in which the following events happen:
Other things of Note:
Suspects:
|
Codecov Report
@@ Coverage Diff @@
## master #6807 +/- ##
==========================================
+ Coverage 62.38% 62.66% +0.27%
==========================================
Files 310 310
Lines 40616 40776 +160
==========================================
+ Hits 25340 25552 +212
+ Misses 13476 13410 -66
- Partials 1800 1814 +14
|
|
Ok I've managed to resolve this issue and it seems to be passing ci.toml. I have seen the occasional failure happen when a statesynced node restarts and then replays prior blocks. It seems to get stuck in the process. I have a feeling this is somewhat orthogonal though. |
| // dispatcher multiplexes concurrent requests by multiple peers for light blocks. | ||
| // Only one request per peer can be sent at a time | ||
| // NOTE: It is not the responsibility of the dispatcher to verify the light blocks. | ||
| type Dispatcher struct { |
There was a problem hiding this comment.
Note: We can make this and BlockProvider private if we want to. The reasons for making it exported is that it's in an internal folder and we plan to reuse these two structs within the light client. Although I'm indifferent either way
There was a problem hiding this comment.
when we go to reuse these in the light client, should they just be moved to the light client instead?
| message ParamsResponse { | ||
| uint64 height = 1; | ||
| tendermint.types.ConsensusParams consensus_params = 2 [(gogoproto.nullable) = false]; | ||
| } No newline at end of file |
There was a problem hiding this comment.
The height field here is currently unnecessary because we only ever have a single request per peer so we know the height of the consensus params when we receive it
There was a problem hiding this comment.
I also thought about allowing consensus_params to be nullable (like I've done in the light block response) so a node can send a null consensus param to signal that they don't have it at that height.
There was a problem hiding this comment.
Is this field mean to represent the height at which those params apply? I.e. if this field is 13 that means that these were the params at height 13? If that's the case, I think that this field would be helpful for debugging so that clients can easily tell what height they are receiving a response for.
There was a problem hiding this comment.
Is this field mean to represent the height at which those params apply?
Exactly, but even without this field, the state provider knows which height it is expecting the consensus params from the peer to be so it should still be able to log it.
tychoish
left a comment
There was a problem hiding this comment.
This looks great! I'm super excited for it.
I think we should backport the changes to syncer.go to 0.34. (and maybe,
I also think there are a lot of TODO items in the node.go file that should probably be called out in an issue as I think most of that construction/orchestration code should move into the statesync package.
I'm also wary of the ways that this relies so heavily on the e2e tests for coverage, but I think it's probably find in the end.
| func (s *stateProviderRPC) Commit(ctx context.Context, height uint64) (*types.Commit, error) { | ||
| s.Lock() | ||
| defer s.Unlock() | ||
| header, err := s.lc.VerifyLightBlockAtHeight(ctx, int64(height), time.Now()) |
There was a problem hiding this comment.
This code, from what I understand, is fetching a "commit". Commit contains much of the meaningful data about a block including the BlockID and set of signatures for the block:
Line 746 in f47414f
| nextLightBlock.Height, err) | ||
| } | ||
| state.ConsensusParams = result.ConsensusParams | ||
| state.LastHeightConsensusParamsChanged = currentLightBlock.Height |
There was a problem hiding this comment.
I'm not sure that this actually represents the last time that the params changed for the blockchain. This line just sets the LastHeightConsensusParamsChanged to the current height regardless of how long ago the blockchain actually changed params. Do we rely on this value to tell us when the blockchain changed?
There was a problem hiding this comment.
This is a local value to the node used as an optimization. Since ConsensusParams hardly changes we shouldn't persist them every height but only when they do change. All other heights should just point to where the last change was. This field is used for that. Since we're starting a fresh Tendermint instance we set the LastHeightConsensusParamsChanged to the height we start at.
Yes this is part of the trusted options tendermint/light/trust_options.go Line 21 in c4df8a3
I brought this up in the node initialization ADR but in short I do agree that reactors such as statesync and blocksync which have a request response model (and not a gossip to everyone model like consensus and mempool), could benefit from a separation between fetching and providing. |
|
I just wanted to share some of my thinking about future works here as this may be helpful to @williambanfield and @creachadair, as my reviewers, to get a sense of my intention and if necessary help steer me to a better design. The motivation of having this dispatcher abstraction was to find a way to link the p2p model with how the light client uses providers. At some, yet to be determined, time in the future, we will most likely move a lot of this code into As William has already pointed out, the p2p channel model obliges reactors to take responsibility of both fetching and supplying the relevant data over these channels. The light provider implementation will most likely be "request only" (light clients keep very little data so I don't think they would be valuable as suppliers of light blocks). Hence, a problem is trying to construct infrastructure for both users running light clients that just want to request and verify the relevant data, and for full nodes that run the light reactor to support light clients and other nodes running state sync. At the moment, the dispatcher encapsulates this "request only" pattern, and the light channel is thus split into supporting the dispatchers requests as well as being wired directly to the state and block store which is necessary to "supply" light blocks. |
internal/statesync/reactor.go
Outdated
| lb, err := r.dispatcher.LightBlock(ctxWithCancel, height, peer) | ||
| subCtx, cancel := context.WithTimeout(ctxWithCancel, lightBlockResponseTimeout) | ||
| defer cancel() | ||
| lb, err := r.dispatcher.LightBlock(subCtx, height, peer) |
There was a problem hiding this comment.
I agree. My only remaining concern is that a defer inside a loop stacks up state until the enclosing function returns. We can hack that with a closure (destructors, Go style), e.g.,
subctx, cancel := context.WithTimeout(ctx, lightBlockResponseTimeout)
lb, err := func() (*types.LightBlock, error) {
defer cancel()
return r.dispatcher.LightBlock(subctx, height, peer)
}()or by explicitly calling it after the call but before we check the errors. The latter is actually easier, but the attraction of a defer for happening even in case of a panic is understandable.
creachadair
left a comment
There was a problem hiding this comment.
Most of my remaining comments are optional and cosmetic. I do have a few questions that I think are worth considering, but I don't see anything I think needs to block merging unless the answers are more complicated than I realized.
| @@ -884,15 +884,46 @@ func (cfg *MempoolConfig) ValidateBasic() error { | |||
|
|
|||
| // StateSyncConfig defines the configuration for the Tendermint state sync service | |||
| type StateSyncConfig struct { | |||
There was a problem hiding this comment.
Thank you, these comments are a really great improvement! 🎉
| // with net.Dial, for example: "host.example.com:2125" | ||
| RPCServers []string `mapstructure:"rpc-servers"` | ||
|
|
||
| // The hash and height of a trusted block. Must be within the trust-period. |
There was a problem hiding this comment.
When you say "within the trust period" does that mean the block's commit timestamp has to be within that interval before the moment at which we're doing state sync?
There was a problem hiding this comment.
Exactly.
block.Time + trustPeriod has to be greater than time.Now()
| // one day less than the unbonding period should suffice. | ||
| TrustPeriod time.Duration `mapstructure:"trust-period"` | ||
|
|
||
| // Time to spend discovering snapshots before initiating a restore. |
There was a problem hiding this comment.
Does "restore" in this context mean the state sync process? Or is a restore what we do if we can't state sync (e.g., because we didn't find any viable snapshots)?
There was a problem hiding this comment.
Restore means the state sync process or specifically the process of transferring chunks of state from one application to another. Basically we wait this time to receive snapshots from peers. We pick the "best" snapshot and then we ask peers for chunks of state that correspond to the snapshot (snapshot is just metadata i.e. height, format, hashes)
There was a problem hiding this comment.
Ah, I see. So the statesync cycle has two phases: Discovery followed by apply, and this duration determines how long we spend in the lobby. I thought this was meant to be a timeout, but I guess it's more of a cutoff? (E.g., we may still have snapshots available, but after this amount of time we'll stop and apply a batch)
What happens if no snapshots are found during the discovery period?
internal/statesync/reactor.go
Outdated
| lb, err := r.dispatcher.LightBlock(ctxWithCancel, height, peer) | ||
| subCtx, cancel := context.WithTimeout(ctxWithCancel, lightBlockResponseTimeout) | ||
| defer cancel() | ||
| lb, err := r.dispatcher.LightBlock(subCtx, height, peer) |
There was a problem hiding this comment.
I agree. My only remaining concern is that a defer inside a loop stacks up state until the enclosing function returns. We can hack that with a closure (destructors, Go style), e.g.,
subctx, cancel := context.WithTimeout(ctx, lightBlockResponseTimeout)
lb, err := func() (*types.LightBlock, error) {
defer cancel()
return r.dispatcher.LightBlock(subctx, height, peer)
}()or by explicitly calling it after the call but before we check the errors. The latter is actually easier, but the attraction of a defer for happening even in case of a panic is understandable.
|
this is just as exciting as everyone's said it is. It will enable teeny tiny tendermints. |
Closes: #6491