statesync: implement p2p state provider by cmwaters · Pull Request #6807 · tendermint/tendermint

cmwaters · 2021-08-09T13:40:02Z

Closes: #6491

cmwaters · 2021-08-26T09:55:26Z

Update: I'm stumped with a bug caught in the e2e tests in which the following events happen:

NodeA starts and after 5 blocks NodeB starts with state sync turned on
NodeB connects to NodeA and initializes the p2p state provider
The P2P state provider (which wraps a light client) requests on start up a light block at the NodeB's trusted height to confirm that the NodeA's block has the same hash
NodeA responds with the light block, NodeB validates it and the p2p state provider is now initialized and ready to go
NodeB inits the syncer, this sends out a snapshot request to NodeA.
NodeA responds with a valid snapshot
NodeB receives the snapshot and goes to add it to the pool. Before doing that, it uses the stateprovider to fetch the apphash at the height of the snapshot
The state provider sends a request at that height and awaits a response
NodeA receives the light block request at that height and returns the light block
NodeB never receives the light block. It halts and eventually times out, rejecting the snapshot and then sending a new request for snapshots although is never able to receive a light block from NodeA in which to get the app hash.

Other things of Note:

The priority of the light block channel is higher than any other channel so messages shouldn't be dropped
This works using the RPC state provider. When we go to backfill the respective amount of blocks, this functionality still works which tells us that the dispatcher responsible for requesting and receiving light blocks works fine.
I have written a unit test that does this entire sync operation using the p2p provider and it passes without blocking
I have added a ton of extra logs to help deduce what is happening

Suspects:

There is a fair bit of concurrency in the state sync reactor as well as the use of mutex's. It's feasible then that something somewhere is deadlocking.
It's hard to imagine the problem arising outside of the statesync reactor, because everything was working fine (in the p2p layer) beforehand

codecov · 2021-08-27T09:37:24Z

Codecov Report

Merging #6807 (590d021) into master (511bd3e) will increase coverage by 0.27%.
The diff coverage is 47.64%.

@@            Coverage Diff             @@
##           master    #6807      +/-   ##
==========================================
+ Coverage   62.38%   62.66%   +0.27%     
==========================================
  Files         310      310              
  Lines       40616    40776     +160     
==========================================
+ Hits        25340    25552     +212     
+ Misses      13476    13410      -66     
- Partials     1800     1814      +14

Impacted Files	Coverage Δ
config/toml.go	`68.33% <ø> (ø)`
internal/consensus/reactor.go	`70.97% <0.00%> (+4.45%)`	⬆️
test/e2e/generator/generate.go	`0.00% <0.00%> (ø)`
node/node.go	`49.33% <9.09%> (+0.26%)`	⬆️
config/config.go	`68.57% <11.11%> (+0.39%)`	⬆️
internal/statesync/stateprovider.go	`29.14% <18.89%> (+29.14%)`	⬆️
proto/tendermint/statesync/message.go	`57.95% <23.52%> (-2.61%)`	⬇️
internal/statesync/syncer.go	`75.61% <24.39%> (-1.61%)`	⬇️
internal/test/factory/p2p.go	`30.00% <30.00%> (ø)`
internal/statesync/reactor.go	`72.57% <70.85%> (+13.33%)`	⬆️
... and 26 more

cmwaters · 2021-08-27T13:03:12Z

Ok I've managed to resolve this issue and it seems to be passing ci.toml. I have seen the occasional failure happen when a statesynced node restarts and then replays prior blocks. It seems to get stuck in the process. I have a feeling this is somewhat orthogonal though.

cmwaters · 2021-08-27T13:09:55Z

internal/statesync/dispatcher.go

+// dispatcher multiplexes concurrent requests by multiple peers for light blocks.
+// Only one request per peer can be sent at a time
+// NOTE: It is not the responsibility of the dispatcher to verify the light blocks.
+type Dispatcher struct {


Note: We can make this and BlockProvider private if we want to. The reasons for making it exported is that it's in an internal folder and we plan to reuse these two structs within the light client. Although I'm indifferent either way

when we go to reuse these in the light client, should they just be moved to the light client instead?

cmwaters · 2021-08-27T13:14:19Z

proto/tendermint/statesync/types.proto

+message ParamsResponse {
+  uint64                           height           = 1;
+  tendermint.types.ConsensusParams consensus_params = 2 [(gogoproto.nullable) = false];
 }


The height field here is currently unnecessary because we only ever have a single request per peer so we know the height of the consensus params when we receive it

I also thought about allowing consensus_params to be nullable (like I've done in the light block response) so a node can send a null consensus param to signal that they don't have it at that height.

Is this field mean to represent the height at which those params apply? I.e. if this field is 13 that means that these were the params at height 13? If that's the case, I think that this field would be helpful for debugging so that clients can easily tell what height they are receiving a response for.

Is this field mean to represent the height at which those params apply?

Exactly, but even without this field, the state provider knows which height it is expecting the consensus params from the peer to be so it should still be able to log it.

tychoish

This looks great! I'm super excited for it.

I think we should backport the changes to syncer.go to 0.34. (and maybe,

I also think there are a lot of TODO items in the node.go file that should probably be called out in an issue as I think most of that construction/orchestration code should move into the statesync package.

I'm also wary of the ways that this relies so heavily on the e2e tests for coverage, but I think it's probably find in the end.

test/e2e/generator/generate.go

internal/statesync/snapshots.go

internal/statesync/stateprovider.go

williambanfield · 2021-08-30T18:52:23Z

internal/statesync/stateprovider.go

+func (s *stateProviderRPC) Commit(ctx context.Context, height uint64) (*types.Commit, error) {
+	s.Lock()
+	defer s.Unlock()
+	header, err := s.lc.VerifyLightBlockAtHeight(ctx, int64(height), time.Now())


This code, from what I understand, is fetching a "commit". Commit contains much of the meaningful data about a block including the BlockID and set of signatures for the block:

tendermint/types/block.go

Line 746 in f47414f

type Commit struct {

williambanfield · 2021-08-30T18:57:15Z

internal/statesync/stateprovider.go

+			nextLightBlock.Height, err)
+	}
+	state.ConsensusParams = result.ConsensusParams
+	state.LastHeightConsensusParamsChanged = currentLightBlock.Height


I'm not sure that this actually represents the last time that the params changed for the blockchain. This line just sets the LastHeightConsensusParamsChanged to the current height regardless of how long ago the blockchain actually changed params. Do we rely on this value to tell us when the blockchain changed?

This is a local value to the node used as an optimization. Since ConsensusParams hardly changes we shouldn't persist them every height but only when they do change. All other heights should just point to where the last change was. This field is used for that. Since we're starting a fresh Tendermint instance we set the LastHeightConsensusParamsChanged to the height we start at.

internal/statesync/stateprovider.go

cmwaters · 2021-08-31T10:49:27Z

Does node B know the hash in advance? Where does it get the information for what the hash at the trusted height should be?

Yes this is part of the trusted options

tendermint/light/trust_options.go

Line 21 in c4df8a3

type TrustOptions struct {

My main feeling is that the abstraction we use for p2p (the channels) makes it a bit difficult for us to easily separate functionality and responsibility where a peer may be involved. Notably, I'm not totally clear on why a piece of code that is responsible for fetching application state also needs to be responsible for providing application state to other peers that may want it. I'm not sure that it makes sense to update all of this right now, but we should endeavor to separate the application functionality as much as possible whenever we use these p2p channels.

I brought this up in the node initialization ADR but in short I do agree that reactors such as statesync and blocksync which have a request response model (and not a gossip to everyone model like consensus and mempool), could benefit from a separation between fetching and providing.

cmwaters · 2021-08-31T11:25:07Z

I just wanted to share some of my thinking about future works here as this may be helpful to @williambanfield and @creachadair, as my reviewers, to get a sense of my intention and if necessary help steer me to a better design. The motivation of having this dispatcher abstraction was to find a way to link the p2p model with how the light client uses providers. At some, yet to be determined, time in the future, we will most likely move a lot of this code into light/provider/p2p which will be its own reactor that can be used to support the light client. The statesync reactor will thus call upon this new reactor to implement StateProvider and to fetch light blocks for the Backfill method. The more I've delved into this the more that I've come across incongruences.

As William has already pointed out, the p2p channel model obliges reactors to take responsibility of both fetching and supplying the relevant data over these channels. The light provider implementation will most likely be "request only" (light clients keep very little data so I don't think they would be valuable as suppliers of light blocks). Hence, a problem is trying to construct infrastructure for both users running light clients that just want to request and verify the relevant data, and for full nodes that run the light reactor to support light clients and other nodes running state sync.

At the moment, the dispatcher encapsulates this "request only" pattern, and the light channel is thus split into supporting the dispatchers requests as well as being wired directly to the state and block store which is necessary to "supply" light blocks.

internal/statesync/reactor_test.go

williambanfield · 2021-08-31T17:47:46Z

internal/statesync/reactor.go

-					lb, err := r.dispatcher.LightBlock(ctxWithCancel, height, peer)
+					subCtx, cancel := context.WithTimeout(ctxWithCancel, lightBlockResponseTimeout)
+					defer cancel()
+					lb, err := r.dispatcher.LightBlock(subCtx, height, peer)


love this change!

I agree. My only remaining concern is that a defer inside a loop stacks up state until the enclosing function returns. We can hack that with a closure (destructors, Go style), e.g.,

subctx, cancel := context.WithTimeout(ctx, lightBlockResponseTimeout) lb, err := func() (*types.LightBlock, error) { defer cancel() return r.dispatcher.LightBlock(subctx, height, peer) }()

or by explicitly calling it after the call but before we check the errors. The latter is actually easier, but the attraction of a defer for happening even in case of a panic is understandable.

creachadair

Most of my remaining comments are optional and cosmetic. I do have a few questions that I think are worth considering, but I don't see anything I think needs to block merging unless the answers are more complicated than I realized.

creachadair · 2021-08-31T18:19:45Z

config/config.go

@@ -884,15 +884,46 @@ func (cfg *MempoolConfig) ValidateBasic() error {

 // StateSyncConfig defines the configuration for the Tendermint state sync service
 type StateSyncConfig struct {


Thank you, these comments are a really great improvement! 🎉

config/config.go

creachadair · 2021-08-31T18:22:17Z

config/config.go

+	// with net.Dial, for example: "host.example.com:2125"
+	RPCServers []string `mapstructure:"rpc-servers"`
+
+	// The hash and height of a trusted block. Must be within the trust-period.


When you say "within the trust period" does that mean the block's commit timestamp has to be within that interval before the moment at which we're doing state sync?

Exactly.

block.Time + trustPeriod has to be greater than time.Now()

creachadair · 2021-08-31T18:23:22Z

config/config.go

+	// one day less than the unbonding period should suffice.
+	TrustPeriod time.Duration `mapstructure:"trust-period"`
+
+	// Time to spend discovering snapshots before initiating a restore.


Does "restore" in this context mean the state sync process? Or is a restore what we do if we can't state sync (e.g., because we didn't find any viable snapshots)?

Restore means the state sync process or specifically the process of transferring chunks of state from one application to another. Basically we wait this time to receive snapshots from peers. We pick the "best" snapshot and then we ask peers for chunks of state that correspond to the snapshot (snapshot is just metadata i.e. height, format, hashes)

Ah, I see. So the statesync cycle has two phases: Discovery followed by apply, and this duration determines how long we spend in the lobby. I thought this was meant to be a timeout, but I guess it's more of a cutoff? (E.g., we may still have snapshots available, but after this amount of time we'll stop and apply a batch)

What happens if no snapshots are found during the discovery period?

creachadair · 2021-08-31T18:43:25Z

internal/statesync/reactor.go

-					lb, err := r.dispatcher.LightBlock(ctxWithCancel, height, peer)
+					subCtx, cancel := context.WithTimeout(ctxWithCancel, lightBlockResponseTimeout)
+					defer cancel()
+					lb, err := r.dispatcher.LightBlock(subCtx, height, peer)


I agree. My only remaining concern is that a defer inside a loop stacks up state until the enclosing function returns. We can hack that with a closure (destructors, Go style), e.g.,

subctx, cancel := context.WithTimeout(ctx, lightBlockResponseTimeout) lb, err := func() (*types.LightBlock, error) { defer cancel() return r.dispatcher.LightBlock(subctx, height, peer) }()

or by explicitly calling it after the call but before we check the errors. The latter is actually easier, but the attraction of a defer for happening even in case of a panic is understandable.

internal/statesync/reactor.go

internal/statesync/snapshots.go

light/client.go

faddat · 2021-09-01T20:52:53Z

this is just as exciting as everyone's said it is. It will enable teeny tiny tendermints.

…t into callum/p2p-provider

cmwaters and others added 9 commits August 6, 2021 14:48

add params proto messages

2bde896

allow exchanging of consensus params

39c54f5

clean up tests

f0ccd0f

p2p state provider

aeb6058

changes while looking for issue

45b5349

refactor dispatcher

d44d281

add tests / modify providers setup

d71edfc

add extra tests

18982cd

add print statements

ffe02e5

cmwaters added 4 commits August 26, 2021 18:24

statesync provider

25aa63a

restore simple.toml

c7cfaf9

fix ci.toml and lint

87f241f

Merge branch 'master' into callum/p2p-provider

0e45a1d

cmwaters added 2 commits August 27, 2021 12:23

make some tweaks to error handling

5baf7a7

allow for dynamic adding of light providers

31a9c5f

cmwaters marked this pull request as ready for review August 27, 2021 13:01

cmwaters requested review from alexanderbez, creachadair, ebuchman, tessr, tychoish and williambanfield as code owners August 27, 2021 13:01

clean up initial height bug

a62c886

cmwaters commented Aug 27, 2021

View reviewed changes

remove commented out line

8955915

tychoish reviewed Aug 27, 2021

View reviewed changes

test/e2e/generator/generate.go Show resolved Hide resolved

williambanfield reviewed Aug 30, 2021

View reviewed changes

cmwaters added 5 commits August 31, 2021 14:54

implement suggestions

8b4c0c5

make modifications to timeouts

d0805fe

Merge branch 'master' into callum/p2p-provider

6ed32f1

add more comments/documentation

8387f54

linting

c24361e

williambanfield reviewed Aug 31, 2021

View reviewed changes

williambanfield approved these changes Aug 31, 2021

View reviewed changes

creachadair approved these changes Aug 31, 2021

View reviewed changes

cmwaters added 4 commits September 1, 2021 08:04

Merge branch 'master' into callum/p2p-provider

6144c5f

implement suggestions

730f251

more descriptive error

6214fe4

add log message

fb599fc

cmwaters mentioned this pull request Sep 1, 2021

statesync: improve stateprovider handling in the syncer (backport) #6881

Merged

cmwaters added 6 commits September 1, 2021 14:43

add cases to catch certain errors

50fbba9

go fmt

84cba05

tweak e2e start

cebacb6

go fmt

54e6721

lint

2432449

Merge branch 'master' into callum/p2p-provider

bebb31b

cmwaters added 3 commits September 2, 2021 00:12

Merge branch 'master' into callum/p2p-provider

2981abc

fix data race

ae24f93

Merge branch 'callum/p2p-provider' of github.com:tendermint/tendermin…

590d021

…t into callum/p2p-provider

cmwaters merged commit bda948e into master Sep 2, 2021

cmwaters deleted the callum/p2p-provider branch September 2, 2021 11:19

creachadair mentioned this pull request Feb 4, 2022

light: client stalls reporting block data after a restart #7773

Closed

cmwaters mentioned this pull request Aug 12, 2022

statesync: productionize protocol #9233

Closed

		@@ -884,15 +884,46 @@ func (cfg *MempoolConfig) ValidateBasic() error {

		// StateSyncConfig defines the configuration for the Tendermint state sync service
		type StateSyncConfig struct {

Conversation

cmwaters commented Aug 9, 2021

Uh oh!

cmwaters commented Aug 26, 2021

Uh oh!

codecov bot commented Aug 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cmwaters commented Aug 27, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tychoish left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cmwaters commented Aug 31, 2021

Uh oh!

cmwaters commented Aug 31, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

creachadair left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

creachadair Sep 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

faddat commented Sep 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov bot commented Aug 27, 2021 •

edited

Loading

creachadair Sep 1, 2021 •

edited

Loading