[WIP] fix non-determinism in consensus/state_test.go by kevlubkcm · Pull Request #2774 · tendermint/tendermint

kevlubkcm · 2018-11-07T14:38:20Z

Updated all relevant documentation in docs
Updated all code comments where relevant
Wrote tests
Updated CHANGELOG_PENDING.md

Need to make sure all of the events are on the same channel so that they actually "queue" up and pause the consensus.

Note that the default eventBus buffer capacity is already 0. This seems like a potential attack vector

kevlubkcm · 2018-11-07T14:38:38Z

This is another fix for issue #846

codecov-io · 2018-11-07T14:46:39Z

Codecov Report

Merging #2774 into develop will decrease coverage by <.01%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           develop    #2774      +/-   ##
===========================================
- Coverage    62.37%   62.36%   -0.01%     
===========================================
  Files          212      212              
  Lines        17210    17201       -9     
===========================================
- Hits         10734    10727       -7     
+ Misses        5578     5574       -4     
- Partials       898      900       +2

Impacted Files	Coverage Δ
privval/ipc_server.go	`64.15% <0%> (-5.67%)`	⬇️
consensus/reactor.go	`67.83% <0%> (-0.47%)`	⬇️
state/errors.go	`0% <0%> (ø)`	⬆️
libs/db/remotedb/grpcdb/client.go	`0% <0%> (ø)`	⬆️

…nitChain (tendermint#2971) Fixes tendermint#2951

…#2977) Co-Authored-By: zramsay <zach.ramsay@gmail.com>

previously, we're turning it off only when --populate-persistent-peers flag was used, which is obviously incorrect. Fixes cosmos/cosmos-sdk#2983

This is only one part of tendermint#2989. We also need to fix the application, and add rules to consensus to ensure this.

Fixes tendermint#2715 In crawlPeersRoutine, which is performed when seedMode is run, there is logic that disconnects the peer's state information at 3-hour intervals through the duration value. The duration value is calculated by referring to the created value of MConnection. When MConnection is created for the first time, the created value is not initiated, so it is not disconnected every 3 hours but every time it is disconnected. So, normal nodes are connected to seedNode and disconnected immediately, so address exchange does not work properly. https://github.com/tendermint/tendermint/blob/master/p2p/pex/pex_reactor.go#L629 This point is not work correctly. I think, https://github.com/tendermint/tendermint/blob/master/p2p/conn/connection.go#L148 created variable is missing the current time setting.

…endermint#2964)

Refs tendermint#2994

Co-Authored-By: zramsay <zach.ramsay@gmail.com>

…endermint#2991) (left after committing a block) Fixes tendermint#2961 -------------- ORIGINAL ISSUE Tendermint version : 0.26.4-b771798d ABCI app : kv-store Environment: OS (e.g. from /etc/os-release): macOS 10.14.1 What happened: Set mempool.recheck = false and create empty block = false in config.toml. When transactions get added right between a new empty block is being proposed and committed, the proposer won't propose new block for that transactions immediately after. That transactions are stuck in the mempool until a new transaction is added and trigger the proposer. What you expected to happen: If there is a transaction left in the mempool, new block should be proposed immediately. Have you tried the latest version: yes How to reproduce it (as minimally and precisely as possible): Fire two transaction using broadcast_tx_sync with specific delay between them. (You may need to do it multiple time before the right delay is found, on my machine the delay is 0.98s) Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file): https://pastebin.com/0Wt6uhPF Config (you can paste only the changes you've made): [mempool] recheck = false create_empty_block = false Anything else we need to know: In mempool.go, we found that proposer will immediately propose new block if Last committed block has some transaction (causing AppHash to changed) or mem.notifyTxsAvailable() is called. Our scenario is as followed. A transaction is fired, it will create 1 block with 1 tx (line 1-4 in the log) and 1 empty block. After the empty block is proposed but before it is committed, second transaction is fired and added to mempool. (line 8-16) Now, since the last committed block is empty and mem.notifyTxsAvailable() will be called only if mempool.recheck = true. The proposer won't immediately propose new block, causing the second transaction to stuck in mempool until another transaction is added to mempool and trigger mem.notifyTxsAvailable().

…so 0 (tendermint#3006) * optimize addProposalBlockPart * optimize addProposalBlockPart * if ProposalBlockParts and LockedBlockParts both exist,let LockedBlockParts overwrite ProposalBlockParts. * fix tryAddBlock * broadcast lockedBlockParts in higher priority * when appHeight==0, it's better fetch genDoc than state.validators. * not save state if replay from height 1 * only save state if replay from height 1 when stateHeight is also 1 * only save state if replay from height 1 when stateHeight is also 1 * only save state if replay from height 0 when stateHeight is also 0 * handshake info's response version only update when stateHeight==0 * save the handshake responseInfo appVersion

* config: cors options are arrays of strings, not strings Fixes tendermint#2980 * docs: update tendermint-core/configuration.html page * set allow_duplicate_ip to false * in `tendermint testnet`, set allow_duplicate_ip to true Refs tendermint#2712 * fixes after Ismail's review * Revert "set allow_duplicate_ip to false" This reverts commit 24c1094.

* update changelog * linkify * changelog and version

Release/v0.27.1

Merge pull request tendermint#3023 from tendermint/release/v0.27.1

…op_into_release/0.31.0 Merge develop into release/0.31.0

* Make sure config.TimeoutBroadcastTxCommit < rpcserver.WriteTimeout() * remove redundant comment * libs/rpc/http_server: move Read/WriteTimeout into Config * increase defaults for read/write timeouts Based on this article https://www.digitalocean.com/community/tutorials/how-to-optimize-nginx-configuration * WriteTimeout should be larger than TimeoutBroadcastTxCommit * set a deadline for subscribing to txs * extract duration into const * add two changelog entries * Update CHANGELOG_PENDING.md Co-Authored-By: melekes <anton.kalyaev@gmail.com> * Update CHANGELOG_PENDING.md Co-Authored-By: melekes <anton.kalyaev@gmail.com> * 12 -> 10 * changelog * changelog

Release/v0.31.0

Merge master back to develop

* Update proposer-selection.md * Fixed typos * fixed typos * Attempt to address some comments * Update proposer-selection.md * Update proposer-selection.md * Update proposer-selection.md Added the normalization step. * Addressed review comments * New example for normalization section Added a new example to better show the need for normalization Added requirement for changing validator set Addressed review comments * Fixed problem with R2 * fixed the math for new validator * test * more small updates * Moved the centering above the round-robin election - the centering is now done before the actual round-robin block - updated examples - cleanup * change to reflect new implementation for new validator

* p2p: refactor Switch#Broadcast func - call wg.Add only once - do not call peers.List twice! * bad for perfomance * peers list can change in between calls! Refs tendermint#3306 * p2p: use time.Ticker instead of RepeatTimer no need in RepeatTimer since we don't Reset them Refs tendermint#3306 * libs/common: remove RepeatTimer (also TimerMaker and Ticker interface) "ancient code that’s caused no end of trouble" Ethan I believe there's much simplier way to write a ticker than can be reset https://medium.com/@arpith/resetting-a-ticker-in-go-63858a2c17ec

) * blockchain: update the maxHeight when a peer is removed Refs tendermint#2699 * add a changelog entry * make linter pass

* comments on validator ordering * NextValidatorsHash

…t#3448) Fixes tendermint#3444

…endermint#3466) In order to re-enable the test harness for the KMS (see tendermint/tmkms#227), we need some marginally more realistic proposals and votes. This is because the KMS does some additional sanity checks now to ensure the height and round are increasing over time.

Refs tendermint#3419

Closes tendermint#1798 This is done by making every mempool tx maintain a list of peers who its received the tx from. Instead of using the 20byte peer ID, it instead uses a local map from peerID to uint16 counter, so every peer adds 2 bytes. (Word aligned to probably make it 8 bytes) This also required resetting the callback function on every CheckTx. This likely has performance ramifications for instruction caching. The actual setting operation isn't costly with the removal of defers in this PR. * Make the mempool not gossip txs back to peers its received it from * Fix adversarial memleak * Don't break interface * Update changelog * Forgot to add a mtx * forgot a mutex * Update mempool/reactor.go Co-Authored-By: ValarDragon <ValarDragon@users.noreply.github.com> * Update mempool/mempool.go Co-Authored-By: ValarDragon <ValarDragon@users.noreply.github.com> * Use unknown peer ID Co-Authored-By: ValarDragon <ValarDragon@users.noreply.github.com> * fix compilation * use next wait chan logic when skipping * Minor fixes * Add TxInfo * Add reverse map * Make activeID's auto-reserve 0 * 0 -> UnknownPeerID Co-Authored-By: ValarDragon <ValarDragon@users.noreply.github.com> * Switch to making the normal case set a callback on the reqres object The recheck case is still done via the global callback, and stats are also set via global callback * fix merge conflict * Addres comments * Add cache tests * add cache tests * minor fixes * update metrics in reqResCb and reformat code * goimport -w mempool/reactor.go * mempool: update memTx senders I had to introduce txsMap for quick mempoolTx lookups. * change senders type from []uint16 to sync.Map Fixes DATA RACE: ``` Read at 0x00c0013fcd3a by goroutine 183: github.com/tendermint/tendermint/mempool.(*MempoolReactor).broadcastTxRoutine() /go/src/github.com/tendermint/tendermint/mempool/reactor.go:195 +0x3c7 Previous write at 0x00c0013fcd3a by D[2019-02-27|10:10:49.058] Read PacketMsg switch=3 peer=35bc1e3558c182927b31987eeff3feb3d58a0fc5@127.0.0.1 :46552 conn=MConn{pipe} packet="PacketMsg{30:2B06579D0A143EB78F3D3299DE8213A51D4E11FB05ACE4D6A14F T:1}" goroutine 190: github.com/tendermint/tendermint/mempool.(*Mempool).CheckTxWithInfo() /go/src/github.com/tendermint/tendermint/mempool/mempool.go:387 +0xdc1 github.com/tendermint/tendermint/mempool.(*MempoolReactor).Receive() /go/src/github.com/tendermint/tendermint/mempool/reactor.go:134 +0xb04 github.com/tendermint/tendermint/p2p.createMConnection.func1() /go/src/github.com/tendermint/tendermint/p2p/peer.go:374 +0x25b github.com/tendermint/tendermint/p2p/conn.(*MConnection).recvRoutine() /go/src/github.com/tendermint/tendermint/p2p/conn/connection.go:599 +0xcce Goroutine 183 (running) created at: D[2019-02-27|10:10:49.058] Send switch=2 peer=1efafad5443abeea4b7a8155218e4369525d987e@127.0.0.1:46193 channel=48 conn=MConn{pipe} m sgBytes=2B06579D0A146194480ADAE00C2836ED7125FEE65C1D9DD51049 github.com/tendermint/tendermint/mempool.(*MempoolReactor).AddPeer() /go/src/github.com/tendermint/tendermint/mempool/reactor.go:105 +0x1b1 github.com/tendermint/tendermint/p2p.(*Switch).startInitPeer() /go/src/github.com/tendermint/tendermint/p2p/switch.go:683 +0x13b github.com/tendermint/tendermint/p2p.(*Switch).addPeer() /go/src/github.com/tendermint/tendermint/p2p/switch.go:650 +0x585 github.com/tendermint/tendermint/p2p.(*Switch).addPeerWithConnection() /go/src/github.com/tendermint/tendermint/p2p/test_util.go:145 +0x939 github.com/tendermint/tendermint/p2p.Connect2Switches.func2() /go/src/github.com/tendermint/tendermint/p2p/test_util.go:109 +0x50 I[2019-02-27|10:10:49.058] Added good transaction validator=0 tx=43B4D1F0F03460BD262835C4AA560DB860CFBBE85BD02386D83DAC38C67B3AD7 res="&{CheckTx:gas_w anted:1 }" height=0 total=375 Goroutine 190 (running) created at: github.com/tendermint/tendermint/p2p/conn.(*MConnection).OnStart() /go/src/github.com/tendermint/tendermint/p2p/conn/connection.go:210 +0x313 github.com/tendermint/tendermint/libs/common.(*BaseService).Start() /go/src/github.com/tendermint/tendermint/libs/common/service.go:139 +0x4df github.com/tendermint/tendermint/p2p.(*peer).OnStart() /go/src/github.com/tendermint/tendermint/p2p/peer.go:179 +0x56 github.com/tendermint/tendermint/libs/common.(*BaseService).Start() /go/src/github.com/tendermint/tendermint/libs/common/service.go:139 +0x4df github.com/tendermint/tendermint/p2p.(*peer).Start() <autogenerated>:1 +0x43 github.com/tendermint/tendermint/p2p.(*Switch).startInitPeer() ``` * explain the choice of a map DS for senders * extract ids pool/mapper to a separate struct * fix literal copies lock value from senders: sync.Map contains sync.Mutex * use sync.Map#LoadOrStore instead of Load * fixes after Ismail's review * rename resCbNormal to resCbFirstTime

Refs tendermint#3306, irisnet/tendermint@fdbb676 I ran an irishub validator. After the validator node ran several days, I dump the whole goroutine stack. I found that there were hundreds of broadcastTxRoutine. However, the connected peer quantity was less than 30. So I belive that there must be broadcastTxRoutine leakage issue. According to my analysis, I think the root cause of this issue locate in below code: select { case <-next.NextWaitChan(): // see the start of the for loop for nil check next = next.Next() case <-peer.Quit(): return case <-memR.Quit(): return } As we know, if multiple paths are avaliable in the same time, then a random path will be selected. Suppose that next.NextWaitChan() and peer.Quit() are both avaliable, and next.NextWaitChan() is chosen. // send memTx msg := &TxMessage{Tx: memTx.tx} success := peer.Send(MempoolChannel, cdc.MustMarshalBinaryBare(msg)) if !success { time.Sleep(peerCatchupSleepIntervalMS * time.Millisecond) continue } Then next will be non-empty and the peer send operation won't be success. As a result, this go routine will be track into infinite loop and won't be released. My proposal is to check peer.Quit() and memR.Quit() in every loop no matter whether next is nil.

…ndermint#3473) I think it's nice when the Client interface has all the methods. If someone does not need a particular method/set of methods, she can use individual interfaces (e.g. NetworkClient, MempoolClient) or write her own interface. technically breaking Fixes tendermint#3458

Why submit this pr: we have suffered from infinite loop in addrbook bug which takes us a long time to find out why process become a zombie peer. It have been fixed in tendermint#3232. But the ADDRS_LOOP is still there, risk of infinite loop is still exist. The algorithm that to random pick a bucket is not stable, which means the peer may unluckily always choose the wrong bucket for a long time, the time and cpu cost is meaningless. A simple improvement: shuffle bucketsNew and bucketsOld, and pick necessary number of address from them. A stable algorithm.

melekes · 2019-03-26T16:43:23Z

Is this still a WIP? Can we get this rebased against the latest develop?

kevlubkcm · 2019-03-26T17:05:19Z

yes, this is still relevant. gimme a bit to rebase against develop

…t into haltConsensusForTest

kevlubkcm · 2019-03-26T17:09:35Z

oops. i messed up the merge. lemme submit a new PR

@odeke-em

…the JSON (backport tendermint#2774) (tendermint#2778) This change fixes a bug in which BitArray.UnmarshalJSON hadn't accounted for the fact that invoking NewBitArray(<=0) returns nil and hence when dereferenced would crash with a runtime nil pointer dereference. This bug was found by my security analysis and fuzzing too. Author: @odeke-em Fixes cometbft/cometbft#2658 --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] ~~Updated relevant documentation (`docs/` or `spec/`) and code comments~~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request tendermint#2774 done by [Mergify](https://mergify.com). --------- Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>

@odeke-em

…the JSON (backport tendermint#2774) (tendermint#2779) This change fixes a bug in which BitArray.UnmarshalJSON hadn't accounted for the fact that invoking NewBitArray(<=0) returns nil and hence when dereferenced would crash with a runtime nil pointer dereference. This bug was found by my security analysis and fuzzing too. Author: @odeke-em Fixes cometbft/cometbft#2658 --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] ~~Updated relevant documentation (`docs/` or `spec/`) and code comments~~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request tendermint#2774 done by [Mergify](https://mergify.com). Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>

@odeke-em

…the JSON (backport tendermint#2774) (tendermint#2780) This change fixes a bug in which BitArray.UnmarshalJSON hadn't accounted for the fact that invoking NewBitArray(<=0) returns nil and hence when dereferenced would crash with a runtime nil pointer dereference. This bug was found by my security analysis and fuzzing too. Author: @odeke-em Fixes cometbft/cometbft#2658 --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] ~~Updated relevant documentation (`docs/` or `spec/`) and code comments~~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request tendermint#2774 done by [Mergify](https://mergify.com). --------- Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com> Co-authored-by: Andy Nogueira <me@andynogueira.dev>

kevlubkcm requested review from ebuchman, melekes and xla as code owners November 7, 2018 14:38

kevlubkcm changed the title ~~put events on the same channel to pause consensus at the desired point~~ [R4R] put events on the same channel to pause consensus at the desired point Nov 7, 2018

kevlubkcm changed the title ~~[R4R] put events on the same channel to pause consensus at the desired point~~ [R4R] put events on the same channel to pause consensus at the desired point in test Nov 7, 2018

kevlubkcm changed the title ~~[R4R] put events on the same channel to pause consensus at the desired point in test~~ [WIP] fix non-determinism in consensus/state_test.go Nov 7, 2018

kevlubkcm mentioned this pull request Nov 7, 2018

consensus: race condition in test #846

Closed

leo-xinwang and others added 21 commits December 7, 2018 12:30

return an error if validator set is empty in genesis file and after I…

2f64717

…nitChain (tendermint#2971) Fixes tendermint#2951

docs: relative links in docs/spec/readme.md, js-amino lib (tendermint…

68b4678

…#2977) Co-Authored-By: zramsay <zach.ramsay@gmail.com>

turn off strict routability every time (tendermint#2983)

41eaf0e

previously, we're turning it off only when --populate-persistent-peers flag was used, which is obviously incorrect. Fixes cosmos/cosmos-sdk#2983

Make mempool fail txs with negative gas wanted (tendermint#2994)

d5d0d2b

This is only one part of tendermint#2989. We also need to fix the application, and add rules to consensus to ensure this.

Make testing logger that doesn't write to stdout (tendermint#2997)

df32ea4

add UnconfirmedTxs/NumUnconfirmedTxs methods to HTTP/Local clients (t…

2594cec

…endermint#2964)

docs: fixes from 'first time' review (tendermint#2999)

8003786

docs: enable full-text search (tendermint#3004)

9e075d8

mempool: add a comment and missing changelog entry (tendermint#2996)

bc2a9b2

Refs tendermint#2994

circleci: add a job to automatically update docs (tendermint#3005)

f7e463f

docs: add edit on Github links (tendermint#3014)

3fbe9f2

docs: update DOCS_README (tendermint#3019)

f5cca9f

Co-Authored-By: zramsay <zach.ramsay@gmail.com>

tendermint#2980 fix cors doc (tendermint#3013)

a75dab4

docs: networks/docker-compose: small fixes (tendermint#3017)

b53a271

Bucky/v0.27.1 (tendermint#3022)

e4806f9

* update changelog * linkify * changelog and version

Merge pull request tendermint#3023 from tendermint/release/v0.27.1

1f09818

Release/v0.27.1

Merge pull request tendermint#3026 from tendermint/master

9fa9596

Merge pull request tendermint#3023 from tendermint/release/v0.27.1

liamsi and others added 20 commits March 19, 2019 12:19

Add tendermint#3421 to changelog and reorder alphabetically

8e62a3d

remove 3421 from changelog

e276f35

Merge pull request tendermint#3449 from tendermint/ismail/merge_devel…

5f68fba

…op_into_release/0.31.0 Merge develop into release/0.31.0

Merge pull request tendermint#3417 from tendermint/release/v0.31.0

0d985ed

Release/v0.31.0

Merge pull request tendermint#3450 from tendermint/master

22bcfca

Merge master back to develop

crypto: delete unused code (tendermint#3426)

60b2ae5

rpc: client disable compression (tendermint#3430)

03085c2

blockchain: update the maxHeight when a peer is removed (tendermint#3350

926127c

) * blockchain: update the maxHeight when a peer is removed Refs tendermint#2699 * add a changelog entry * make linter pass

comments on validator ordering (tendermint#3452)

81b9bdf

* comments on validator ordering * NextValidatorsHash

fix comment (tendermint#3454)

660bd4a

replace PB2TM.ConsensusParams with a call to params#Update (tendermin…

1d4afb1

…t#3448) Fixes tendermint#3444

rpc: support tls rpc (tendermint#3469)

25a3c8b

Refs tendermint#3419

kevlubkcm added 2 commits March 26, 2019 17:06

put events on the same channel to pause consensus at the desired point

9128e6f

Merge branch 'haltConsensusForTest' of github.com:kevlubkcm/tendermin…

a3ced8f

…t into haltConsensusForTest

kevlubkcm requested a review from zramsay as a code owner March 26, 2019 17:07

kevlubkcm closed this Mar 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] fix non-determinism in consensus/state_test.go#2774

[WIP] fix non-determinism in consensus/state_test.go#2774
kevlubkcm wants to merge 343 commits intotendermint:developfrom
kevlubkcm:haltConsensusForTest

kevlubkcm commented Nov 7, 2018

Uh oh!

kevlubkcm commented Nov 7, 2018 •

edited

Loading

Uh oh!

codecov-io commented Nov 7, 2018

Uh oh!

melekes commented Mar 26, 2019

Uh oh!

kevlubkcm commented Mar 26, 2019

Uh oh!

kevlubkcm commented Mar 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

kevlubkcm commented Nov 7, 2018

Uh oh!

kevlubkcm commented Nov 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Nov 7, 2018

Codecov Report

Uh oh!

melekes commented Mar 26, 2019

Uh oh!

kevlubkcm commented Mar 26, 2019

Uh oh!

kevlubkcm commented Mar 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

kevlubkcm commented Nov 7, 2018 •

edited

Loading