Enable local snapshot restore (backport: #733)#728
Enable local snapshot restore (backport: #733)#728yihuang wants to merge 2 commits intocometbft:v0.34.xfrom
Conversation
Sync state in Handshake via RPC CV Cleanup dead code
adizere
left a comment
There was a problem hiding this comment.
Thanks for the initial pass at this solution, very grateful for the work here!
I left some questions / suggestions. I would be glad to fix myself some of the concerns I raised directly in the dev branch, if that's ok with you @yihuang and @chillyvee; but will of course give you priority to make commits on the branch.
| if err := doHandshake(stateStore, state, blockStore, genDoc, eventBus, proxyApp, consensusLogger); err != nil { | ||
|
|
||
| // RPC recovery is required if blockStore Height is 0 | ||
| if blockStore.Height() == 0 || !stateSync { |
There was a problem hiding this comment.
Shouldn't we also check that state.LastBlockHeight == 0 in addition to blockStore.Height() == 0?
Something like:
| if blockStore.Height() == 0 || !stateSync { | |
| if (blockStore.Height() == 0 && state.LastBlockHeight == 0) || !stateSync { |
There was a problem hiding this comment.
This is worth discussing since there are gaps in knowledge that need to be filled. There are three cases where the comet height is zero.
- Actually genesis
- Statesync completely over the network as we use today
- LocalStatesync where the application.db is already restored, but comet is at 0
One of the cosmos-sdk to comet integration challenges for local state sync is communicating the height of appstate/application.db
In the initial implementation, an extra applicationd.db height parameter was passed from cosmos-sdk to comet. However a previous code review tried to eliminate that extra parameter and simply check if comet height was 0, and start RPC statesync.
However that actually causes some issues because a normal network statesync also has comet height == 0, but application.db is also 0.
We need an agreeable way to determine the height of application.db from cosmos-sdk. Since a previous code review determined a parameter was not desirable, is there a better way to query the application height? Is there an object we should look at, or make an ABCI call?
There was a problem hiding this comment.
I see, thanks for the detailed notes and history here, I wasn't aware of the subtlety here!
Is there an object we should look at, or make an ABCI call?
An ABCI call (info) would indeed provide the app height I believe. But not sure that's the best way.
There was a problem hiding this comment.
Originally it was added as a variable to configure when cosmos-sdk created the Node{}, but it was asked to be removed during code-review, probably to avoid creating a new dependency between cosmos-sdk and comet
Now that we know the removal causes an issue, it's up to us to pick a new way to compare the heights of comet vs cosmos-sdk app.
Is there general guidance on how to communicate variables between comet and cosmos-sdk?
There was a problem hiding this comment.
So, for context, a valid CometBFT state requires:
blockStore.Height() - 1 <= state.LastBlockHeight <= blockStore.Height()
So, if the commit of block at height blockStore.Height() has been completed, the state height should be the same. If it hasn't, as the node crashed in the while, the state height should be 1 unit below.
Having said so, if the state of CometBFT is valid and blockStore.Height() == 0 then state.LastBlockHeight == 0 is mandatory.
| // Optimistically build new state, so we don't discover any light client failures at the end. | ||
| state, err = h.stateProvider.State(pctx, appBlockHeight) | ||
| if err != nil { | ||
| h.logger.Error("failed to fetch and verify tendermint state", "err", err) |
There was a problem hiding this comment.
Nit: Should we use "comet" instead of "tendermint" ?
| if err == light.ErrNoWitnesses { | ||
| return sm.State{}, nil, err | ||
| } | ||
| return sm.State{}, nil, errors.New("snapshot was rejected") |
There was a problem hiding this comment.
Can we be more specific in this error message? It seems to me that what happened here was that Comet's light client verified the app hash present in the local snapshot of app.db, and this app hash was incorrect. Is this accurate?
There was a problem hiding this comment.
Referencing func (s *lightClientStateProvider) State(ctx context.Context, height uint64) (sm.State, error) {
The error conditions are failure to VerifyLightBlockAtHeight , RPC Client Creation Failure (Bad URL), and failure to fetch consensus parameters for the height.
What do you think of wrapping the error to add detail?
There was a problem hiding this comment.
Yes, that's a good idea. Let's wrap the err to save the detail and return it to the caller.
There was a problem hiding this comment.
There is not snapshot involved here.
The error should say that the application state was rejected, instead.
| if err == light.ErrNoWitnesses { | ||
| return sm.State{}, nil, err | ||
| } | ||
| return sm.State{}, nil, errors.New("snapshot was rejected") |
There was a problem hiding this comment.
Same comment as above: Would it be good to have more explicit error message here?
There was a problem hiding this comment.
What do you think of wrapping the error to add detail?
|
Haven't looked into detail, but find an unexpected behavior in test: when the app height is 0, statesync enabled, it still try to init with genesis.json, which don't happen with normal version. |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
| if err == light.ErrNoWitnesses { | ||
| return sm.State{}, nil, err | ||
| } | ||
| return sm.State{}, nil, errors.New("snapshot was rejected") |
There was a problem hiding this comment.
There is not snapshot involved here.
The error should say that the application state was rejected, instead.
| if err := h.stateStore.Save(state); err != nil { | ||
| return sm.State{}, nil, err | ||
| } | ||
| if err := h.store.(*store.BlockStore).SaveSeenCommit(int64(appBlockHeight), commit); err != nil { |
There was a problem hiding this comment.
Following the original implementation, which is Node.startStateSync (I don't know the branch to refer, so no link, sorry), you should use state.LastBlockHeight here for compatibility. It should be equal to appBlockHeight, except for uint/int stuff, but for coherence it is better to follow the same approach.
| h.logger.Info("localSync resulting state.Version.Consensus.Block", "state", state.Version.Consensus.Block) | ||
|
|
||
| // Assume Heights Restored, Update Heights for edge cases and constraints on the storeBlockHeight and storeBlockBase. | ||
| storeBlockHeight = appBlockHeight |
There was a problem hiding this comment.
I would say that these values should be read from the stores, not blindly updated here.
| var err error | ||
| ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) | ||
| defer cancel() | ||
| stateProvider, err := statesync.NewLightClientStateProvider( |
There was a problem hiding this comment.
If we are not using this "local sync", this would error, right?
We have to still support the case in which the node is simply recovering, and will restore all state from disk.
| } | ||
|
|
||
| // Run state sync | ||
| // Statesync reqeusting all data from P2P |
There was a problem hiding this comment.
| // Statesync reqeusting all data from P2P | |
| // State sync fetching snapshots from peers |
| // Run state sync | ||
| // Statesync reqeusting all data from P2P | ||
| if n.stateSync { | ||
| // If state and blockStore.Height are both at the same height, skip the P2P Statesync and immediately enter consensus |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
|
Are we proceeding with this approach/solution attempt? If so, we should add the |
I think no need to pursue |
|
Closing, a different solution was chosen. |
|
Which other solution was chosen? Just had a break in schedule to work on the Juno v14 local state sync demo today. |
|
|
Thank you, will review |
|
nice code @yihuang :) |
Closes: #29
Original PR: tendermint/tendermint#9541
Sync state in Handshake via RPC
CV Cleanup dead code
PR checklist
.changelog(we use unclog to manage our changelog)docs/orspec/) and code comments