Skip to content

cometbft v1.0#760

Closed
jchappelow wants to merge 4 commits intotrufnetwork:mainfrom
chappjc:comet1.0
Closed

cometbft v1.0#760
jchappelow wants to merge 4 commits intotrufnetwork:mainfrom
chappjc:comet1.0

Conversation

@jchappelow
Copy link
Contributor

@jchappelow jchappelow commented May 21, 2024

DO NOT MERGE.

This completes the switch to cometbft v1.0 (from v0.38) started in #621

This was a quick first pass, but it amazingly seems to pass acceptance test, but I don't know about integ yet.

The motivation for doing this now is two fold:

  1. In performance evaluations by @charithabandi there are a number of unexplainable bottlenecks and issues with both under-utilization and extended block intervals. Possibly some of them are resolved in v1.x and we may not need to waste our time chasing the cause since we already know some of the issues are in cometbft code (e.g. the logging issue).
  2. We are closing in on v0.8 and updating to cometbft v1.0 on a live network appears to be a very difficult or impossible. While cometbft v1.0 is not yet released, it seems to be getting close with an alpha release almost two months ago. See their tags page.

There are a number of major changes:

  • An api module that is separate from their main module. Pretty much every variable name is changed.
  • timeout_prevote and timeout_commit are consolidate to timeout_vote. Surely this is a consensus breaker.
  • Their GenesisDoc type no longer has a ABCI field, and now has a Synchrony and Feature field
  • abciTypes.Ed25519ValidatorUpdate is gone, and we have to manually make the ed25519 crypto.PrivKey and construct a cmtAPIabci.ValidatorUpdate instace. This is actually minor.
  • cometNodes.GenesisDocProvider now returns a struct with the genesis doc and an optional hash. We might investigate using this to ensure the correct genesis file is used.
  • ValidatorSigner has a new SignBytes method and change method signatures. There are tweaks with extension signatures taken from upstream FilePV as well. We should audit their changes and ensure we have them.

So, there is a lot to investigate and research on their github or elsewhere to understand:

  • why have these things changed, what is the motivation?
  • what else may have changed?
  • how the heck can you upgrade smoothly without consensus breaking

@jchappelow
Copy link
Contributor Author

CometBFT v1.0 release plan: cometbft/cometbft#578
(looks nearly complete, down to tagging modules, but who knows)

@jchappelow
Copy link
Contributor Author

Back to draft now that CI ran once. (pass 🥳 )

@adizere
Copy link

adizere commented Jul 3, 2024

Hey,

Happy to comment on some of these. You are one of the first teams to experiment with upgrading to v1 so really appreciate the questions and thoughts here as they enable us to know what is important vs. unimportant.

timeout_prevote and timeout_commit are consolidate to timeout_vote. Surely this is a consensus breaker.

Actually this is not consensus breaking. (Ref) What is important is that validators/node operators coordinate and aware of the configuration change and maintain the configuration of timeout_vote consistent. But this change is not encoded does not affect content of blocks or hashes, so it is not consensus breaking.

Their GenesisDoc type no longer has a ABCI field, and now has a Synchrony and Feature field

Largely due to introduction of PBTS. Ref: https://informal.systems/blog/introducing-proposer-based-timestamps-pbts-in-cometbft

why have these things changed, what is the motivation?
what else may have changed?
how the heck can you upgrade smoothly without consensus breaking

The changelog should be comprehensive. I understand the feedback and feeling here is that too many things changed; is it less disruptive and more useful for you to have smaller, more frequent releases?

Actually if you do not enable PBTS I believe nothing in v1 is consensus breaking. I will double check this but your validators might not even need to coordinate in rolling out the upgrade - the proto changes might require coordination (let me confirm though). This is because the block protocol remains unchanged fundamentally in v1.

Separate topic: Here is a basic reference we have w.r.t upgrade taxonomy.

Glad to explain further or in a call if useful.

@jchappelow
Copy link
Contributor Author

@adizere Thanks! Sorry for the delay in responding -- you commented while I was on break.
I will review and get back to you ASAP. We will certainly appreciate any guidance that comes from cometbft/cometbft#3416

@charithabandi
Copy link
Contributor

charithabandi commented Aug 20, 2024

@adizere Hey, I tried rolling updates to comet1.0 build, here are my first set of observations:

  • Good news, No consensus breakage with a rolling upgrades. Tested with both scenarios with super majority of nodes in the new and old versions.
  • P2P panics:
    nodes on older versions panic as they receive messages from the nodes on newer versions, potentially due to the changes in this PR which introduces this new message (HasProposalBlockPartMessage). The panic closes the connection to the peer and reestablish it, and I observe tis behavior with every block.
  • Is panic necessary here? can the protocol be changed to ignore these messages and not cause the old nodes to panic?
  • What are the consequences of not consuming these HasProposalBlockPartMessage messages? Are there any liveness issues? I have not noticed any, but want to confirm. Also, will this force us to do a coordinated upgrades among the validators?

Node with older versions: Panicking
image

Peer disconnects with every block:
Screenshot 2024-08-20 at 12 06 10 PM

  • Logging issues:
    There are lots of noise from the mempool, which I don't think is necessary. Below are the screenshots
    Screenshot 2024-08-20 at 12 06 31 PM
    This is one other log message I used to observe a lot while doing stress test and even on smaller loads. But I haven't observed this anymore on comet1.0, looks like u guys fixed it. But I want to bring it up.
    Screenshot 2024-08-20 at 2 28 10 PM

@charithabandi
Copy link
Contributor

charithabandi commented Aug 20, 2024

Indexer works as expected with both versioned and non versioned endpoints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants