fix(blocksync)!: don't block in blocksync if our voting power is blocking the chain by sergio-mena · Pull Request #3406 · cometbft/cometbft

sergio-mena · 2024-07-03T10:16:04Z

Partially addresses #3415

The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height.

However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers.

Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following:

I am a node and I am starting... shall I run blocksync?
Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"

Explanation of commits:

Commit 1: e2e testbed reproducing the issue
Commit 2: commit with a trivial change to trigger e2e tests. Check the error: ❌ next to the commit hash (3fb1057)
Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that e2e are passing now.
Commit 4: revert commit2
Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason
Commit 6: Fixed unit tests

All further commits: addressing other comments and tidying up the code

PR checklist

Tests written/updated
Changelog entry added in .changelog (we use unclog to manage our changelog)
~~[ ] Updated relevant documentation (docs/ or spec/) and code comments~~
Title follows the Conventional Commits spec

This reverts commit 3fb1057.

cason

Not sure about this workaround.

We should now whether we should run block sync outside the protocol. But, ok, it works. But by changing the block Reactor constructor, we breaking a lot of code.

internal/blocksync/pool.go

internal/blocksync/reactor.go

node/setup.go

internal/blocksync/reactor.go

internal/blocksync/reactor_test.go

internal/blocksync/reactor.go

cason

I would approve, but the >=1/3 vs >2/3 question remains open.

See associated comment (line 515).

internal/blocksync/reactor.go

internal/blocksync/reactor_test.go

internal/blocksync/reactor.go

.changelog/unreleased/bug-fixes/3406-blocksync-dont-stall-if-blocking-chain.md

Co-authored-by: Daniel <daniel.cason@informal.systems>

ValarDragon · 2024-07-04T06:56:28Z

I think it's definitely safe to backport, it can't really affect mainnets as you need one Val w/ over 1/3 to do anything. (And it only helps users right now if it's on the 38 line)

cason · 2024-07-04T07:21:49Z

Is this safe to backport to 0.38/v1 (post-rc1 v1)?

We need to find a solution for v0.37.x too...

sergio-mena · 2024-07-04T08:51:12Z

Is this safe to backport to 0.38/v1 (post-rc1 v1)?

To me, it's a bug, so unless there is big risk identified I'd backport it. Besides, this is clearly holding teams back, which are on v0.38.x/v0.37.x. Please reply if you don't agree.

@cason

…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit bd95579) # Conflicts: # internal/blocksync/reactor.go

@cason

…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit bd95579) # Conflicts: # .changelog/v0.38.3/bug-fixes/3406-blocksync-dont-stall-if-blocking-chain.md # blocksync/reactor.go # blocksync/reactor_test.go # node/node.go

@cason

…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit bd95579) # Conflicts: # blocksync/reactor_test.go # internal/blocksync/reactor.go # node/node.go # node/setup.go

… is blocking the chain (#3406)" This reverts commit b069346.

@cason

…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems>

@cason

…king the chain (backport #3406) (#3420) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3406 done by [Mergify](https://mergify.com). --------- Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Daniel <daniel.cason@informal.systems>

… is blocking the chain (#3406)" This reverts commit 1b99304.

@cason

…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems>

… is blocking the chain (#3406)" This reverts commit 7f268b0.

@cason

…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems>

@cason

…ing the chain (backport #3406) (#3421) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3406 done by [Mergify](https://mergify.com). --------- Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Daniel <daniel.cason@informal.systems>

@cason

…ing the chain (backport #3406) (#3422) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3406 done by [Mergify](https://mergify.com). --------- Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Daniel <daniel.cason@informal.systems>

Contributes to #3415 This is mainly refactoring to simplify `onlyValidatorIsUs` and `localNodeBlocksTheChain` (since the latter implies the former). It is a follow-up of #3406 (this is the part of #3406 that doesn't need to be backported) --- #### PR checklist - ~[ ] Tests written/updated~ - ~[ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog)~ - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>

@cason

…ing the chain (backport cometbft#3406) (cometbft#3421) Partially addresses cometbft#3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3406 done by [Mergify](https://mergify.com). --------- Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Daniel <daniel.cason@informal.systems>

* Added votes to header + added secp256k1 + other changes * updated import * txHash fix+update canonical rep * removed sig size * docs: fix consensus spec formatting (cometbft#3804) * abci/server: recover from app panics in socket server (cometbft#3809) fixes cometbft#3800 * abci/client: fix DATA RACE in gRPC client (cometbft#3798) * Remove go func {}() closes #357 - Remove go func(){}() that caused race condiditon - To reproduce - add -race in make file to `install_abci` - Remove `CGO_ENABLED=0` & add -race to `install` Signed-off-by: Marko Baricevic <marbar3778@yahoo.com> * remove -race * fix data race also, reorder callbacks similarly to socket client * docs: "Writing a built-in Tendermint Core application in Go" guide (cometbft#3608) * docs: go built-in guide * fix package imports, add badger db, simplify Query * newTendermint function * working example * finish the first guide * add one more note * add the second Golang guide - external ABCI app * fix typos * libs: Remove db from tendermint in favor of tendermint/tm-cmn (cometbft#3811) * Remove db from tendemrint in favor of tendermint/tm-cmn - remove db from `libs` - update dependancy, there have been no breaking changes in the updated deps - https://github.com/grpc/grpc-go/releases - https://github.com/golang/protobuf/releases Signed-off-by: Marko Baricevic <marbar3778@yahoo.com> * changelog add * gofmt * more gofmt * docs: add A TOC to the Readme.md of ADR Section (#3820) * ADR TOC in readme.md * Added A TOC to the Readme.md of ADR Section - Added table of contents to the Readme of the architecture section. - Easier to traverse and when you know what is there. - If the Adr's become viewable online it would help guide the user Signed-off-by: Marko Baricevic <marbar3778@yahoo.com> * add tm-cmn to subprojects * normalize word * rpc: make max_body_bytes and max_header_bytes configurable (cometbft#3818) * rpc: make max_body_bytes and max_header_bytes configurable * update changelog pending * p2p/conn: Add Bufferpool (cometbft#3664) * use byte buffer pool to decreass allocs * wrap to put buffer in defer * wapper defer * add dependency * remove Gopkg,* * add change log * rpc: /broadcast_evidence (cometbft#3481) * implement broadcast_duplicate_vote endpoint * fix test_cover * address comments * address comments * Update abci/example/kvstore/persistent_kvstore.go Co-Authored-By: mossid <torecursedivine@gmail.com> * Update rpc/client/main_test.go Co-Authored-By: mossid <torecursedivine@gmail.com> * address comments in progress * reformat the code * make linter happy * make tests pass * replace BroadcastDuplicateVote with BroadcastEvidence * fix test * fix endpoint name * improve doc * fix TestBroadcastEvidenceDuplicateVote * Update rpc/core/evidence.go Co-Authored-By: Thane Thomson <connect@thanethomson.com> * add changelog entry * fix TestBroadcastEvidenceDuplicateVote * mempool: make max_msg_bytes configurable (cometbft#3826) * mempool: make max_msg_bytes configurable * apply suggestions from code review * update changelog pending * apply suggestions from code review again * rpc: return err if page is incorrect (less than 0 or greater than tot… (cometbft#3825) * rpc: return err if page is incorrect (less than 0 or greater than total pages) Fixes cometbft#3813 * fix rpc_test * blockchain: Reorg reactor (cometbft#3561) * go routines in blockchain reactor * Added reference to the go routine diagram * Initial commit * cleanup * Undo testing_logger change, committed by mistake * Fix the test loggers * pulled some fsm code into pool.go * added pool tests * changes to the design added block requests under peer moved the request trigger in the reactor poolRoutine, triggered now by a ticker in general moved everything required for making block requests smarter in the poolRoutine added a simple map of heights to keep track of what will need to be requested next added a few more tests * send errors to FSM in a different channel than blocks send errors (RemovePeer) from switch on a different channel than the one receiving blocks renamed channels added more pool tests * more pool tests * lint errors * more tests * more tests * switch fast sync to new implementation * fixed data race in tests * cleanup * finished fsm tests * address golangci comments :) * address golangci comments :) * Added timeout on next block needed to advance * updating docs and cleanup * fix issue in test from previous cleanup * cleanup * Added termination scenarios, tests and more cleanup * small fixes to adr, comments and cleanup * Fix bug in sendRequest() If we tried to send a request to a peer not present in the switch, a missing continue statement caused the request to be blackholed in a peer that was removed and never retried. While this bug was manifesting, the reactor kept asking for other blocks that would be stored and never consumed. Added the number of unconsumed blocks in the math for requesting blocks ahead of current processing height so eventually there will be no more blocks requested until the already received ones are consumed. * remove bpPeer's didTimeout field * Use distinct err codes for peer timeout and FSM timeouts * Don't allow peers to update with lower height * review comments from Ethan and Zarko * some cleanup, renaming, comments * Move block execution in separate goroutine * Remove pool's numPending * review comments * fix lint, remove old blockchain reactor and duplicates in fsm tests * small reorg around peer after review comments * add the reactor spec * verify block only once * review comments * change to int for max number of pending requests * cleanup and godoc * Add configuration flag fast sync version * golangci fixes * fix config template * move both reactor versions under blockchain * cleanup, golint, renaming stuff * updated documentation, fixed more golint warnings * integrate with behavior package * sync with master * gofmt * add changelog_pending entry * move to improvments * suggestion to changelog entry * Renamed wire.go to codec.go (cometbft#3827) * Renamed wire.go to codec.go - Wire was the previous name of amino - Codec describes the file better than `wire` & `amino` Signed-off-by: Marko Baricevic <marbar3778@yahoo.com> * ide error * rename amino.go to codec.go * docs: add guides to docs (cometbft#3830) * add staticcheck linting (cometbft#3828) cleanup to add linter grpc change: https://godoc.org/google.golang.org/grpc#WithContextDialer https://godoc.org/google.golang.org/grpc#WithDialer grpc/grpc-go#2627 prometheous change: due to UninstrumentedHandler, being deprecated in the future empty branch = empty if or else statement didn't delete them entirely but commented couldn't find a reason to have them could not replicate the issue cometbft#3406 but if want to keep it commented then we should comment out the if statement as well * types: move MakeVote / MakeBlock functions (cometbft#3819) to the types package Paritally Fixes cometbft#3584 * p2p: Fix error logging for connection stop (cometbft#3824) * p2p: fix false-positive error logging when stopping connections This changeset fixes two types of false-positive errors occurring during connection shutdown. The first occurs when the process invokes FlushStop() or Stop() on a connection. While the previous behavior did properly wait for the sendRoutine to finish, it did not notify the recvRoutine that the connection was shutting down. This would cause the recvRouting to receive and error when reading and log this error. The changeset fixes this by notifying the recvRoutine that the connection is shutting down. The second occurs when the connection is terminated (gracefully) by the other side. The recvRoutine would get an EOF error during the read, log it, and stop the connection with an error. The changeset detects EOF and gracefully shuts down the connection. * bring back the comment about flushing * add changelog entry * listen for quitRecvRoutine too * we have to call stopForError Otherwise peer won't be removed from the peer set and maybe readded later. * p2p: Do not write 'Couldn't connect to any seeds' if there are no seeds (cometbft#3834) * Do not write 'Couldn't connect to any seeds' if there are no seeds * changelog * remove privValUpgrade * Fix typo in changelog * Update CHANGELOG_PENDING.md Co-Authored-By: Marko <marbar3778@yahoo.com> I'm setting up all peers dynamically by calling dial_peers, so p2p.seeds in configs is empty, and I'm seeing error log a lot in logs. * docs: add a footer to guides (cometbft#3835) * docs: "Writing a Tendermint Core application in Kotlin (gRPC)" guide (cometbft#3838) * add abci grpc kotlin guide * Update docs/guides/kotlin.md Co-Authored-By: Anton Kaliaev <anton.kalyaev@gmail.com> * Update docs/guides/kotlin.md Co-Authored-By: Anton Kaliaev <anton.kalyaev@gmail.com> * Update docs/guides/kotlin.md Co-Authored-By: Anton Kaliaev <anton.kalyaev@gmail.com> * Update kotlin.md * node: allow replacing existing p2p.Reactor(s) (cometbft#3846) * node: allow replacing existing p2p.Reactor(s) using [`CustomReactors` option](https://godoc.org/github.com/tendermint/tendermint/node#CustomReactors). Warning: beware of accidental name clashes. Here is the list of existing reactors: MEMPOOL, BLOCKCHAIN, CONSENSUS, EVIDENCE, PEX. * check the absence of "CUSTOM" prefix * merge 2 tests * add doc.go to node package * gocritic (1/2) (cometbft#3836) Add gocritic as a linter The linting is not complete, but should i complete in this PR or in a following. 23 files have been touched so it may be better to do in a following PR Commits: * Add gocritic to linting - Added gocritic to linting Signed-off-by: Marko Baricevic <marbar3778@yahoo.com> * gocritic * pr comments * remove switch in cmdBatch * tm-cmn to tm-db (cometbft#3850) * tm-cmn to tm-db * go.mod changes * go.mod changes * more go.mod * fix tm-db * ci fix, pending change * version tmdb (cometbft#3854) * txindexer: Refactor Tx Search Aggregation (cometbft#3851) - Replace the previous intersect call, which was called at each query condition, with a map intersection. - Replace fmt.Sprintf with string() closes: cometbft#3076 Benchmarks ``` Old goos: darwin goarch: amd64 pkg: github.com/tendermint/tendermint/state/txindex/kv BenchmarkTxSearch-4 200 103641206 ns/op 7998416 B/op 71171 allocs/op PASS ok github.com/tendermint/tendermint/state/txindex/kv 26.019s New goos: darwin goarch: amd64 pkg: github.com/tendermint/tendermint/state/txindex/kv BenchmarkTxSearch-4 1000 38615024 ns/op 13515226 B/op 166460 allocs/op PASS ok github.com/tendermint/tendermint/state/txindex/kv 53.618s ``` ~62% performance improvement Commits: * Refactor tx search * Add pending changelog entry * Add tx search benchmarking * remove intermediate hashes list also reset timer in BenchmarkTxSearch and fix other benchmark * fix import * Add test cases * Fix searching * Replace fmt.Sprintf with string * Update state/txindex/kv/kv.go Co-Authored-By: Anton Kaliaev <anton.kalyaev@gmail.com> * Rename params * Cleanup * Check error in benchmarks * release for v0.32.2 * Merge PR cometbft#3860: Update log v0.32.2 * changelog updates * pr comments * Fix for panic in signature verification if a peer sends a nil public key. * update version.go * Changelog update * Update CHANGELOG.md Co-Authored-By: Anton Kaliaev <anton.kalyaev@gmail.com> * update changelog * p2p: only allow ed25519 pubkeys when connecting also, recover from any possible failures in acceptPeers Refs cometbft#4030 * update changelog and bump version to v0.32.6 * set the date to today * cs: panic only when WAL#WriteSync fails - modify WAL#Write and WAL#WriteSync to return an error * types: validate Part#Proof add ValidateBasic to crypto/merkle/SimpleProof * cs: limit max bit array size and block parts count * cs: test new limits * cs: only assert important stuff * update changelog and bump version to 0.32.7 * fixes after Ethan's review * align max wal msg and max consensus msg sizes * fix tests * fix test * use bor * add data in commit * remove votes from header * new: add proposal results in vote * fix: go mod * new: add sidechannel proto objects * new: add begin side blocker and deliver side tx * new: add side tx results in begin side block * add: add side tx results into request begin side-block * chg: add address in sig object * chg: add events in side block * chg: allow empty sig * chg: add flag to execute side-tx while not syncing * chg: remove data from vote * fix: use last byte on bigendian bytes * fix: call sidetx result for string method * feat: add rollback feature * Use bor version v0.2.16 * Change log level tag from a single character to a full word This will change logging format from: D[2016-05-02|11:06:44.322] to: DEBUG[2016-05-02|11:06:44.322] The purpose is to unify the logging with bor. * consensus,scripts,state,store,types: change PartSetHeader total to uint32 * libs/log: add warn log level (cometbft#27) * libs/log: add warn log level * mardizzone/POS-1609: dev: chg: bump btcd dep and solve related issues * mardizzone/POS-1609: dev: chg: solve vulnerabilities associated with some packages * mardizzone/POS-1609: dev: chg: update bor version and replace tm-db * mardizzone/POS-1609: dev: chg: bump go version * mardizzone/POS-1609: dev: chg: bump go version to latest patch * Changed the value of default maxNumInboundPeers and maxNumOutboundPeers * made Stopping peer for error log as debig (cometbft#30) * made dialing failed log as debug (cometbft#31) * Added log to print number of peers (cometbft#32) * added log to print number of peers * update * peppermint: changes to crypto * Modified NewFilePV to generate secp256k1 * (temporarily) allow both tendermint/P*KeySecp256k1 and comet/P*KeySecp256k1Uncompressed to ease migration * Forward-port disabled `MaxSignatureSize` checks (+ new ones needed) * cherry pick secp256k1 migration commits + go mod tidy * blocksync,consensus,crypto,libs,types: fix tests and more conflicts * consensus,libs,types: fix tests, vulns from govuln and some lint errors * ci: bump go version to 1.21.4 * Fixed `TestPubKeySecp256k1Address` * crypto: enforce curve group order checks in genPrivKey * abci,crypto: fix conflicts and tests * types: fix TestInvalidPrecommitExtensions * fix lint * Extend kvstore example add with with key types * Fix `TestReactorValidatorSetChanges` * Fix UTs in `execution_test.go` * Fix `TestEvidencePoolBasic` * Fix `TestVoteExtension` * test/e2e: use go 1.21.4 in docker * test/e2e: use secp256k1 as default key type in testnet setup * p2p/conn: use secp256k1 for p2p authentication * p2p/conn: allow both secp256k1 and ed25519 key types for authentication * all: address PR comments * types,blocksync: fix lint + tests + bump deps complained by govuln * crypto,state,test: resolve conflicts from v0.38.5 * abci: resolve conflicts from v0.38.5 * resolve go mod deps * Revert "Merge branch 'v0.38.5-upstream' into raneet10/peppermint-changes" This reverts commit 2706fc9, reversing changes made to e404e0f. * Revert "Revert "Merge branch 'v0.38.5-upstream' into raneet10/peppermint-changes"" This reverts commit fc56973. * all: fix issue from merge * docs: remove Warn log definition from ADR * state: remove outdated comments * types: increase MaxSignatureSize to 65 and unskip related tests * cmd: minor refactor Co-authored-by: Sergio Mena <sergio@informal.systems> * libs/protoio: minor refactor Co-authored-by: Sergio Mena <sergio@informal.systems> * libs/pubsub: minor refactor Co-authored-by: Sergio Mena <sergio@informal.systems> * state: minor refactor Co-authored-by: Sergio Mena <sergio@informal.systems> * state: minor restructure in test Co-authored-by: Sergio Mena <sergio@informal.systems> * types: fix TestMaxCommitBytes + lint * state,types: fix TestTxFilter and TestBlockMaxDataBytes * types: fix TestBlockMaxDataBytesNoEvidence * types: fix TestInvalidPrecommitExtensions * abci,types: address comments * crypto,proto: add secp256k1_uncompressed oneof in PublicKey proto message type * remove revive from .golangci.yml * remove replace of go-ethereum dep with bor and go mod tidy --------- Co-authored-by: vaibhavchellani <vaibhavchellani223@gmail.com> Co-authored-by: Alex Dupre <sysadmin@alexdupre.com> Co-authored-by: Roman Useinov <roman.useinov@gmail.com> Co-authored-by: Marko <marbar3778@yahoo.com> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com> Co-authored-by: Jun Kimura <junkxdev@gmail.com> Co-authored-by: zjubfd <296179868@qq.com> Co-authored-by: Anca Zamfir <ancazamfir@users.noreply.github.com> Co-authored-by: folex <0xdxdy@gmail.com> Co-authored-by: Ivan Kushmantsev <kushmantsev@gmail.com> Co-authored-by: Alexander Bezobchuk <alexanderbez@users.noreply.github.com> Co-authored-by: Ethan Buchman <ethan@coinculture.info> Co-authored-by: Zaki Manian <zaki@manian.org> Co-authored-by: Zaki Manian <zaki@tendermint.com> Co-authored-by: Jaynti Kanani <jdkanani@gmail.com> Co-authored-by: Sai Kumar <sai@vitwit.com> Co-authored-by: Krishna Upadhyaya <krishnau1604@gmail.com> Co-authored-by: Jerry <jerrycgh@gmail.com> Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com> Co-authored-by: marcello33 <marcelloardizzone@hotmail.it> Co-authored-by: Vaibhav Jindal <vaibhavjindal29@gmail.com> Co-authored-by: VaibhavJindal <74560896+VAIBHAVJINDAL3012@users.noreply.github.com> Co-authored-by: Pratik Patil <pratikspatil024@gmail.com> Co-authored-by: Sergio Mena <sergio@informal.systems>

sergio-mena self-assigned this Jul 3, 2024

Regresssion testbed

0160866

sergio-mena force-pushed the sergio/blocksync-stalled-no-peers branch from 69ccf42 to 0160866 Compare July 3, 2024 11:54

sergio-mena added 3 commits July 3, 2024 13:59

Dummy commit to trigger e2e tests

3fb1057

Tentative fix

16a46ea

Revert "Dummy commit to trigger e2e tests"

b8fd5cf

This reverts commit 3fb1057.

cason reviewed Jul 3, 2024

View reviewed changes

internal/blocksync/pool.go Outdated Show resolved Hide resolved

internal/blocksync/reactor.go Outdated Show resolved Hide resolved

internal/blocksync/reactor.go Outdated Show resolved Hide resolved

cason reviewed Jul 3, 2024

View reviewed changes

node/setup.go Outdated Show resolved Hide resolved

cason reviewed Jul 3, 2024

View reviewed changes

node/setup.go Outdated Show resolved Hide resolved

cason reviewed Jul 3, 2024

View reviewed changes

internal/blocksync/reactor.go Outdated Show resolved Hide resolved

sergio-mena added 4 commits July 3, 2024 18:34

Moved the check for 'blocking the chain' outside the pool

b5a2c23

Fixed unit tests

464bc5b

fix units tests (leftover)

77c3fbf

Improve names, move function

c044662

sergio-mena commented Jul 3, 2024

View reviewed changes

internal/blocksync/reactor.go Outdated Show resolved Hide resolved

Update internal/blocksync/reactor.go

4bfa164

sergio-mena commented Jul 3, 2024

View reviewed changes

internal/blocksync/reactor_test.go Outdated Show resolved Hide resolved

Update internal/blocksync/reactor_test.go

257c592

sergio-mena commented Jul 3, 2024

View reviewed changes

internal/blocksync/reactor.go Outdated Show resolved Hide resolved

Add changelog

2105016

sergio-mena marked this pull request as ready for review July 3, 2024 17:55

sergio-mena requested a review from a team as a code owner July 3, 2024 17:55

sergio-mena requested a review from a team July 3, 2024 17:55

sergio-mena added bug Something isn't working block-sync labels Jul 3, 2024

cason reviewed Jul 3, 2024

View reviewed changes

sergio-mena and others added 3 commits July 3, 2024 21:02

1/3 of the voting power is enough to block the chain

7a4753e

Rename myAddr to localAddr

ae36b45

Apply @cason's suggestions from code review

c178d1b

Co-authored-by: Daniel <daniel.cason@informal.systems>

cason mentioned this pull request Jul 3, 2024

blocksync: define the exact conditions for a node to attempt block syncing #3415

Open

4 tasks

sergio-mena added this pull request to the merge queue Jul 4, 2024

sergio-mena added backport-to-v0.37.x backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x labels Jul 4, 2024

Merged via the queue into main with commit bd95579 Jul 4, 2024

sergio-mena deleted the sergio/blocksync-stalled-no-peers branch July 4, 2024 08:54

mergify bot mentioned this pull request Jul 4, 2024

fix(blocksync)!: don't block in blocksync if our voting power is blocking the chain (backport #3406) #3420

Merged

3 tasks

This was referenced Jul 4, 2024

fix(blocksync): don't block in blocksync if our voting power is blocking the chain (backport #3406) #3421

Merged

fix(blocksync): don't block in blocksync if our voting power is blocking the chain (backport #3406) #3422

Merged

sergio-mena added a commit that referenced this pull request Jul 4, 2024

Revert "fix(blocksync)!: don't block in blocksync if our voting power…

dcf9c6c

… is blocking the chain (#3406)" This reverts commit b069346.

sergio-mena added a commit that referenced this pull request Jul 5, 2024

Revert "fix(blocksync)!: don't block in blocksync if our voting power…

d8ca2c0

… is blocking the chain (#3406)" This reverts commit 1b99304.

sergio-mena added a commit that referenced this pull request Jul 5, 2024

Revert "fix(blocksync)!: don't block in blocksync if our voting power…

18cad0c

… is blocking the chain (#3406)" This reverts commit 7f268b0.

sergio-mena mentioned this pull request Aug 28, 2024

refactor(node): simplify 'node is blocking chain' logic #3885

Merged

faddat mentioned this pull request Oct 15, 2024

chore: use latest cometbft-db in v0.38.x #4296

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(blocksync)!: don't block in blocksync if our voting power is blocking the chain#3406

fix(blocksync)!: don't block in blocksync if our voting power is blocking the chain#3406
sergio-mena merged 14 commits intomainfrom
sergio/blocksync-stalled-no-peers

sergio-mena commented Jul 3, 2024 •

edited

Loading

Uh oh!

cason left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cason left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ValarDragon commented Jul 4, 2024

Uh oh!

cason commented Jul 4, 2024

Uh oh!

sergio-mena commented Jul 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sergio-mena commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR checklist

Uh oh!

cason left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cason left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ValarDragon commented Jul 4, 2024

Uh oh!

cason commented Jul 4, 2024

Uh oh!

sergio-mena commented Jul 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sergio-mena commented Jul 3, 2024 •

edited

Loading