Skip to content

[WIP] fix non-determinism in consensus/state_test.go#2774

Closed
kevlubkcm wants to merge 343 commits intotendermint:developfrom
kevlubkcm:haltConsensusForTest
Closed

[WIP] fix non-determinism in consensus/state_test.go#2774
kevlubkcm wants to merge 343 commits intotendermint:developfrom
kevlubkcm:haltConsensusForTest

Conversation

@kevlubkcm
Copy link
Contributor

  • Updated all relevant documentation in docs
  • Updated all code comments where relevant
  • Wrote tests
  • Updated CHANGELOG_PENDING.md

Need to make sure all of the events are on the same channel so that they actually "queue" up and pause the consensus.

Note that the default eventBus buffer capacity is already 0. This seems like a potential attack vector

@kevlubkcm
Copy link
Contributor Author

kevlubkcm commented Nov 7, 2018

This is another fix for issue #846

@codecov-io
Copy link

Codecov Report

Merging #2774 into develop will decrease coverage by <.01%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           develop    #2774      +/-   ##
===========================================
- Coverage    62.37%   62.36%   -0.01%     
===========================================
  Files          212      212              
  Lines        17210    17201       -9     
===========================================
- Hits         10734    10727       -7     
+ Misses        5578     5574       -4     
- Partials       898      900       +2
Impacted Files Coverage Δ
privval/ipc_server.go 64.15% <0%> (-5.67%) ⬇️
consensus/reactor.go 67.83% <0%> (-0.47%) ⬇️
state/errors.go 0% <0%> (ø) ⬆️
libs/db/remotedb/grpcdb/client.go 0% <0%> (ø) ⬆️

@kevlubkcm kevlubkcm changed the title put events on the same channel to pause consensus at the desired point [R4R] put events on the same channel to pause consensus at the desired point Nov 7, 2018
@kevlubkcm kevlubkcm changed the title [R4R] put events on the same channel to pause consensus at the desired point [R4R] put events on the same channel to pause consensus at the desired point in test Nov 7, 2018
@kevlubkcm kevlubkcm changed the title [R4R] put events on the same channel to pause consensus at the desired point in test [WIP] fix non-determinism in consensus/state_test.go Nov 7, 2018
leo-xinwang and others added 21 commits December 7, 2018 12:30
previously, we're turning it off only when --populate-persistent-peers
flag was used, which is obviously incorrect.

Fixes cosmos/cosmos-sdk#2983
This is only one part of tendermint#2989. We also need to fix the application,
and add rules to consensus to ensure this.
Fixes tendermint#2715

In crawlPeersRoutine, which is performed when seedMode is run, there is
logic that disconnects the peer's state information at 3-hour intervals
through the duration value. The duration value is calculated by
referring to the created value of MConnection. When MConnection is
created for the first time, the created value is not initiated, so it is
not disconnected every 3 hours but every time it is disconnected. So,
normal nodes are connected to seedNode and disconnected immediately, so
address exchange does not work properly.

https://github.com/tendermint/tendermint/blob/master/p2p/pex/pex_reactor.go#L629
This point is not work correctly.
I think,
https://github.com/tendermint/tendermint/blob/master/p2p/conn/connection.go#L148
created variable is missing the current time setting.
Co-Authored-By: zramsay <zach.ramsay@gmail.com>
…endermint#2991)

(left after committing a block)

Fixes tendermint#2961

--------------
ORIGINAL ISSUE

Tendermint version : 0.26.4-b771798d

ABCI app : kv-store

Environment:

OS (e.g. from /etc/os-release): macOS 10.14.1 What happened: Set
mempool.recheck = false and create empty block = false in config.toml.
When transactions get added right between a new empty block is being
proposed and committed, the proposer won't propose new block for that
transactions immediately after. That transactions are stuck in the
mempool until a new transaction is added and trigger the proposer.

What you expected to happen: If there is a transaction left in the
mempool, new block should be proposed immediately.

Have you tried the latest version: yes

How to reproduce it (as minimally and precisely as possible): Fire two
transaction using broadcast_tx_sync with specific delay between them.
(You may need to do it multiple time before the right delay is found, on
my machine the delay is 0.98s)

Logs (paste a small part showing an error (< 10 lines) or link a
pastebin, gist, etc. containing more of the log file):
https://pastebin.com/0Wt6uhPF

Config (you can paste only the changes you've made): [mempool] recheck =
false create_empty_block = false

Anything else we need to know: In mempool.go, we found that proposer
will immediately propose new block if

Last committed block has some transaction (causing AppHash to changed)
or mem.notifyTxsAvailable() is called. Our scenario is as followed.

A transaction is fired, it will create 1 block with 1 tx (line 1-4 in
the log) and 1 empty block. After the empty block is proposed but before
it is committed, second transaction is fired and added to mempool. (line
8-16) Now, since the last committed block is empty and
mem.notifyTxsAvailable() will be called only if mempool.recheck = true.
The proposer won't immediately propose new block, causing the second
transaction to stuck in mempool until another transaction is added to
mempool and trigger mem.notifyTxsAvailable().
…so 0 (tendermint#3006)

* optimize addProposalBlockPart

* optimize addProposalBlockPart

* if ProposalBlockParts and LockedBlockParts both exist,let LockedBlockParts overwrite ProposalBlockParts.

* fix tryAddBlock

* broadcast lockedBlockParts in higher priority

* when appHeight==0, it's better fetch genDoc than state.validators.

* not save state if replay from height 1

* only save state if replay from height 1 when stateHeight is also 1

* only save state if replay from height 1 when stateHeight is also 1

* only save state if replay from height 0 when stateHeight is also 0

* handshake info's response version only update when stateHeight==0

* save the handshake responseInfo appVersion
* config: cors options are arrays of strings, not strings

Fixes tendermint#2980

* docs: update tendermint-core/configuration.html page

* set allow_duplicate_ip to false

* in `tendermint testnet`, set allow_duplicate_ip to true

Refs tendermint#2712

* fixes after Ismail's review

* Revert "set allow_duplicate_ip to false"

This reverts commit 24c1094.
* update changelog

* linkify

* changelog and version
Merge pull request tendermint#3023 from tendermint/release/v0.27.1
liamsi and others added 20 commits March 19, 2019 12:19
…op_into_release/0.31.0

Merge develop into release/0.31.0
* Make sure config.TimeoutBroadcastTxCommit < rpcserver.WriteTimeout()

* remove redundant comment

* libs/rpc/http_server: move Read/WriteTimeout into Config

* increase defaults for read/write timeouts

Based on this article
https://www.digitalocean.com/community/tutorials/how-to-optimize-nginx-configuration

* WriteTimeout should be larger than TimeoutBroadcastTxCommit

* set a deadline for subscribing to txs

* extract duration into const

* add two changelog entries

* Update CHANGELOG_PENDING.md

Co-Authored-By: melekes <anton.kalyaev@gmail.com>

* Update CHANGELOG_PENDING.md

Co-Authored-By: melekes <anton.kalyaev@gmail.com>

* 12 -> 10

* changelog

* changelog
* Update proposer-selection.md

* Fixed typos

* fixed typos

* Attempt to address some comments

* Update proposer-selection.md

* Update proposer-selection.md

* Update proposer-selection.md

Added the normalization step.

* Addressed review comments

* New example for normalization section

Added a new example to better show the need for normalization
Added requirement for changing validator set
Addressed review comments

* Fixed problem with R2

* fixed the math for new validator

* test

* more small updates

* Moved the centering above the round-robin election

- the centering is now done before the actual round-robin block
- updated examples
- cleanup

* change to reflect new implementation for new validator
* p2p: refactor Switch#Broadcast func

- call wg.Add only once
- do not call peers.List twice!
  * bad for perfomance
  * peers list can change in between calls!

Refs tendermint#3306

* p2p: use time.Ticker instead of RepeatTimer

no need in RepeatTimer since we don't Reset them

Refs tendermint#3306

* libs/common: remove RepeatTimer (also TimerMaker and Ticker interface)

"ancient code that’s caused no end of trouble" Ethan

I believe there's much simplier way to write a ticker than can be reset
https://medium.com/@arpith/resetting-a-ticker-in-go-63858a2c17ec
)

* blockchain: update the maxHeight when a peer is removed

Refs tendermint#2699

* add a changelog entry

* make linter pass
* comments on validator ordering

* NextValidatorsHash
…endermint#3466)

In order to re-enable the test harness for the KMS (see
tendermint/tmkms#227), we need some marginally more realistic proposals
and votes. This is because the KMS does some additional sanity checks
now to ensure the height and round are increasing over time.
Closes tendermint#1798

This is done by making every mempool tx maintain a list of peers who its received the tx from. Instead of using the 20byte peer ID, it instead uses a local map from peerID to uint16 counter, so every peer adds 2 bytes. (Word aligned to probably make it 8 bytes)

This also required resetting the callback function on every CheckTx. This likely has performance ramifications for instruction caching. The actual setting operation isn't costly with the removal of defers in this PR.

* Make the mempool not gossip txs back to peers its received it from

* Fix adversarial memleak

* Don't break interface

* Update changelog

* Forgot to add a mtx

* forgot a mutex

* Update mempool/reactor.go

Co-Authored-By: ValarDragon <ValarDragon@users.noreply.github.com>

* Update mempool/mempool.go

Co-Authored-By: ValarDragon <ValarDragon@users.noreply.github.com>

* Use unknown peer ID

Co-Authored-By: ValarDragon <ValarDragon@users.noreply.github.com>

* fix compilation

* use next wait chan logic when skipping

* Minor fixes

* Add TxInfo

* Add reverse map

* Make activeID's auto-reserve 0

* 0 -> UnknownPeerID

Co-Authored-By: ValarDragon <ValarDragon@users.noreply.github.com>

* Switch to making the normal case set a callback on the reqres object

The recheck case is still done via the global callback, and stats
are also set via global callback

* fix merge conflict

* Addres comments

* Add cache tests

* add cache tests

* minor fixes

* update metrics in reqResCb and reformat code

* goimport -w mempool/reactor.go

* mempool: update memTx senders

I had to introduce txsMap for quick mempoolTx lookups.

* change senders type from []uint16 to sync.Map

Fixes DATA RACE:

```
Read at 0x00c0013fcd3a by goroutine 183:
  github.com/tendermint/tendermint/mempool.(*MempoolReactor).broadcastTxRoutine()
      /go/src/github.com/tendermint/tendermint/mempool/reactor.go:195 +0x3c7

Previous write at 0x00c0013fcd3a by D[2019-02-27|10:10:49.058] Read PacketMsg                               switch=3 peer=35bc1e3558c182927b31987eeff3feb3d58a0fc5@127.0.0.1
:46552 conn=MConn{pipe} packet="PacketMsg{30:2B06579D0A143EB78F3D3299DE8213A51D4E11FB05ACE4D6A14F T:1}"
goroutine 190:
  github.com/tendermint/tendermint/mempool.(*Mempool).CheckTxWithInfo()
      /go/src/github.com/tendermint/tendermint/mempool/mempool.go:387 +0xdc1
  github.com/tendermint/tendermint/mempool.(*MempoolReactor).Receive()
      /go/src/github.com/tendermint/tendermint/mempool/reactor.go:134 +0xb04
  github.com/tendermint/tendermint/p2p.createMConnection.func1()
      /go/src/github.com/tendermint/tendermint/p2p/peer.go:374 +0x25b
  github.com/tendermint/tendermint/p2p/conn.(*MConnection).recvRoutine()
      /go/src/github.com/tendermint/tendermint/p2p/conn/connection.go:599 +0xcce

Goroutine 183 (running) created at:
D[2019-02-27|10:10:49.058] Send                                         switch=2 peer=1efafad5443abeea4b7a8155218e4369525d987e@127.0.0.1:46193 channel=48 conn=MConn{pipe} m
sgBytes=2B06579D0A146194480ADAE00C2836ED7125FEE65C1D9DD51049
  github.com/tendermint/tendermint/mempool.(*MempoolReactor).AddPeer()
      /go/src/github.com/tendermint/tendermint/mempool/reactor.go:105 +0x1b1
  github.com/tendermint/tendermint/p2p.(*Switch).startInitPeer()
      /go/src/github.com/tendermint/tendermint/p2p/switch.go:683 +0x13b
  github.com/tendermint/tendermint/p2p.(*Switch).addPeer()
      /go/src/github.com/tendermint/tendermint/p2p/switch.go:650 +0x585
  github.com/tendermint/tendermint/p2p.(*Switch).addPeerWithConnection()
      /go/src/github.com/tendermint/tendermint/p2p/test_util.go:145 +0x939
  github.com/tendermint/tendermint/p2p.Connect2Switches.func2()
      /go/src/github.com/tendermint/tendermint/p2p/test_util.go:109 +0x50

I[2019-02-27|10:10:49.058] Added good transaction                       validator=0 tx=43B4D1F0F03460BD262835C4AA560DB860CFBBE85BD02386D83DAC38C67B3AD7 res="&{CheckTx:gas_w
anted:1 }" height=0 total=375
Goroutine 190 (running) created at:
  github.com/tendermint/tendermint/p2p/conn.(*MConnection).OnStart()
      /go/src/github.com/tendermint/tendermint/p2p/conn/connection.go:210 +0x313
  github.com/tendermint/tendermint/libs/common.(*BaseService).Start()
      /go/src/github.com/tendermint/tendermint/libs/common/service.go:139 +0x4df
  github.com/tendermint/tendermint/p2p.(*peer).OnStart()
      /go/src/github.com/tendermint/tendermint/p2p/peer.go:179 +0x56
  github.com/tendermint/tendermint/libs/common.(*BaseService).Start()
      /go/src/github.com/tendermint/tendermint/libs/common/service.go:139 +0x4df
  github.com/tendermint/tendermint/p2p.(*peer).Start()
      <autogenerated>:1 +0x43
  github.com/tendermint/tendermint/p2p.(*Switch).startInitPeer()
```

* explain the choice of a map DS for senders

* extract ids pool/mapper to a separate struct

* fix literal copies lock value from senders: sync.Map contains sync.Mutex

* use sync.Map#LoadOrStore instead of Load

* fixes after Ismail's review

* rename resCbNormal to resCbFirstTime
Refs tendermint#3306, irisnet/tendermint@fdbb676

I ran an irishub validator. After the validator node ran several days, I dump the whole goroutine stack. I found that there were hundreds of broadcastTxRoutine. However, the connected peer quantity was less than 30. So I belive that there must be broadcastTxRoutine leakage issue.

According to my analysis, I think the root cause of this issue locate in below code:

		select {
		case <-next.NextWaitChan():
			// see the start of the for loop for nil check
			next = next.Next()
		case <-peer.Quit():
			return
		case <-memR.Quit():
			return
		}

As we know, if multiple paths are avaliable in the same time, then a random path will be selected. Suppose that next.NextWaitChan() and peer.Quit() are both avaliable, and next.NextWaitChan() is chosen.

                // send memTx
		msg := &TxMessage{Tx: memTx.tx}
		success := peer.Send(MempoolChannel, cdc.MustMarshalBinaryBare(msg))
		if !success {
			time.Sleep(peerCatchupSleepIntervalMS * time.Millisecond)
			continue
		}

Then next will be non-empty and the peer send operation won't be success. As a result, this go routine will be track into infinite loop and won't be released.

My proposal is to check peer.Quit() and memR.Quit() in every loop no matter whether next is nil.
…ndermint#3473)

I think it's nice when the Client interface has all the methods. If someone does not need a particular method/set of methods, she can use individual interfaces (e.g. NetworkClient, MempoolClient) or write her own interface.

technically breaking

Fixes tendermint#3458
Why submit this pr:

    we have suffered from infinite loop in addrbook bug which takes us a long time to find out why process become a zombie peer. It have been fixed in tendermint#3232. But the ADDRS_LOOP is still there, risk of infinite loop is still exist.
    The algorithm that to random pick a bucket is not stable, which means the peer may unluckily always choose the wrong bucket for a long time, the time and cpu cost is meaningless.

A simple improvement:
shuffle bucketsNew and bucketsOld, and pick necessary number of address from them. A stable
algorithm.
@melekes
Copy link
Contributor

melekes commented Mar 26, 2019

Is this still a WIP? Can we get this rebased against the latest develop?

@kevlubkcm
Copy link
Contributor Author

yes, this is still relevant. gimme a bit to rebase against develop

@kevlubkcm kevlubkcm requested a review from zramsay as a code owner March 26, 2019 17:07
@kevlubkcm
Copy link
Contributor Author

oops. i messed up the merge. lemme submit a new PR

@kevlubkcm kevlubkcm closed this Mar 26, 2019
iKapitonau pushed a commit to scrtlabs/tendermint that referenced this pull request Jul 10, 2024
…the JSON (backport tendermint#2774) (tendermint#2778)

This change fixes a bug in which BitArray.UnmarshalJSON hadn't accounted
for the fact that invoking NewBitArray(<=0) returns nil and hence when
dereferenced would crash with a runtime nil pointer dereference. This
bug was found by my security analysis and fuzzing too.

Author: @odeke-em 

Fixes cometbft/cometbft#2658

---

#### PR checklist

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] ~~Updated relevant documentation (`docs/` or `spec/`) and code
comments~~
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request tendermint#2774 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
cboh4 pushed a commit to scrtlabs/tendermint that referenced this pull request Apr 7, 2025
…the JSON (backport tendermint#2774) (tendermint#2779)

This change fixes a bug in which BitArray.UnmarshalJSON hadn't accounted
for the fact that invoking NewBitArray(<=0) returns nil and hence when
dereferenced would crash with a runtime nil pointer dereference. This
bug was found by my security analysis and fuzzing too.

Author: @odeke-em 

Fixes cometbft/cometbft#2658

---

#### PR checklist

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] ~~Updated relevant documentation (`docs/` or `spec/`) and code
comments~~
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request tendermint#2774 done by
[Mergify](https://mergify.com).

Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
cboh4 pushed a commit to scrtlabs/tendermint that referenced this pull request Apr 7, 2025
…the JSON (backport tendermint#2774) (tendermint#2780)

This change fixes a bug in which BitArray.UnmarshalJSON hadn't accounted
for the fact that invoking NewBitArray(<=0) returns nil and hence when
dereferenced would crash with a runtime nil pointer dereference. This
bug was found by my security analysis and fuzzing too.

Author: @odeke-em 

Fixes cometbft/cometbft#2658

---

#### PR checklist

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] ~~Updated relevant documentation (`docs/` or `spec/`) and code
comments~~
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request tendermint#2774 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
Co-authored-by: Andy Nogueira <me@andynogueira.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.