Optimizing blockchain reactor. by jaekwon · Pull Request #1805 · tendermint/tendermint

jaekwon · 2018-06-23T05:01:31Z

With this change and cosmos/iavl#65 and tmlibs reverse iterators, the bottleneck I think is the speed of signature validation. Before optimizing that it might make sense to consider a different kind of blockchain syncing strategy, and before that, straight up state syncing.

We should implement state syncing soon after launch. :)

We're still limited to around 100 blocks/sec on the testnet, but state syncing won't be too difficult to implement given RangeProof.

Should be paired with cosmos/iavl#65.

melekes

🥇 🏎 🍰

melekes · 2018-06-23T07:04:38Z

blockchain/reactor.go

+				continue FOR_LOOP
+			} else {
+				// Try again quickly next loop.
+				didProcessCh <- struct{}{}


This won't lead to deadlock, right? otherwise, optional write (with default) would be here

Looks like it shouldn't since we read an element off the channel to start this block and nothing else writes to the channel except a mutually exclusive block of code. so this should be fine, but it's not immediately obvious

ebuchman · 2018-06-23T14:43:53Z

state/store.go

 // Responses are indexed by height so they can also be loaded later to produce Merkle proofs.
 func saveABCIResponses(db dbm.DB, height int64, abciResponses *ABCIResponses) {
-	db.SetSync(calcABCIResponsesKey(height), abciResponses.Bytes())
+	db.Set(calcABCIResponsesKey(height), abciResponses.Bytes())


we need to flush these to disk incase we crash before we get to call SaveState

Though I would have thought the persistence tests would have caught this if it was wrong. Hmph.

You're right, we should flush these, cuz otherwise we'll commit to the app and possibly crash, losing the responses.

I think it's difficult to catch this because to catch it, it isn't sufficient to fail after saveABCIResponse(), but we need to also call updateState+blockExec.Commit and then fail (exit) before leveldb finalizes...

I think we can simulate this by implementing a special wrapper around DB that delays .Set() for many seconds until maybe the next *Sync call. I'll make an issue of it.

https://github.com/tendermint/tendermint/issues/2024

ebuchman · 2018-06-23T14:49:03Z

@ValarDragon has indicated that we can cut signature verification time in half by using batch verification.

@milosevic recently wrote a spec for this reactor here: https://github.com/tendermint/tendermint/blob/develop/docs/spec/reactors/block_sync/reactor.md

Also note some previously identified issues in #766

zmanian · 2018-06-24T18:19:26Z

I have a somewhat fundamental question here.

Is it optimal that we need information from 2 blocks to verify signatures?

It seems like we could include all the information needed to verify signatures in each block, do signature verification in pool.AddBlock and only checked for consistency of data between block n, block n+1 in the reactor.

This would allow for massive parallelism on signature verification and immediately punish peers who send us bad sigs.

ebuchman · 2018-06-25T02:39:35Z

It seems like we could include all the information needed to verify signatures in each block,

This would come at the cost of slowing down consensus and would be a non-trivial change. Currently we consider having seen a commit for block H when we see the +2/3 precommits for the block H, but the commit only becomes canonical and included in the blockchain when its included as the LastCommit of the next block H+1

zmanian · 2018-06-25T07:03:00Z

How about if we check in the blockpool when we recieve block n if n-1 is in the pool and if it is we do a signature verification then on n-1 and then have a flag on the block is sig verification passed and then the blockchain reactor just does the remaining verification work.

milosevic · 2018-06-26T14:34:06Z

@jaekwon You might want to take a look at this issue: #1734. The requester choice strategy seems like not optimal. This might lead to bad utilisation of peer resources during fast sync protocol. Furthermore, there are some design choices in blockchain reactor that might also potentially negatively affect performance:

During processing of bcBlockRequestMessage that is executed in the p2p receive routine, we are reading block from disc and block during that time p2p receive routine.
We are locking global mutex when handling of several requests, creating potentially contention on this lock and potentially also blocking p2p receive routine.

Before optimising blockchain reactor, maybe it make sense to first profile its current performance so we understand bottlenecks so we are able to measure impact of some potential modifications.
Hopefully, the spec https://github.com/tendermint/tendermint/blob/develop/docs/spec/reactors/block_sync/reactor.md can be useful in that case as high level design document.

ebuchman · 2018-07-01T04:56:13Z

blockchain/reactor.go

+				continue FOR_LOOP
+			} else {
+				// Try again quickly next loop.
+				didProcessCh <- struct{}{}


Looks like it shouldn't since we read an element off the channel to start this block and nothing else writes to the channel except a mutually exclusive block of code. so this should be fine, but it's not immediately obvious

ebuchman · 2018-07-01T04:58:46Z

types/canonical_json.go


 // TimeFormat is used for generating the sigs
-const TimeFormat = "2006-01-02T15:04:05.000Z"
+const TimeFormat = time.RFC3339Nano


Is there some reason for this ? I believe this makes the PR a breaking change, while it would otherwise not be

Consistency. We should use Nano everywhere.

milosevic · 2018-07-19T13:49:15Z

blockchain/reactor.go


-	const capacity = 1000 // must be bigger than peers count
-	requestsCh := make(chan BlockRequest, capacity)
+	requestsCh := make(chan BlockRequest, maxTotalRequesters)


What is a benefit of this change?

it's just some reasonable constant that is already defined. vs having a magic 1000 in the middle of a function declaration.

milosevic · 2018-07-20T10:29:30Z

The suggested change brings several issues, but I am not sure that we properly address them with this PR.

Verifying committed blocks is time consuming task and as @jaekwon pointed out in this PR, the logic for verifying blocks, their execution and saving to disk should probably be moved into a separate routine. Checking blocks can maybe be triggered upon new block reception, instead of being time based. There is probably benefit (as mention in this PR) in batching writes to disk and executing blocks against app.
Sending requests for block is in the current version is potentially blocked by other blocks in poolRoutine() function. It is not clear to me what would be the drawback of sending requests directly from Requester task, and also ensuring that request is sent. In the current version, we are using trySend and if buffer is full, request will not be sent.

As there are also several other concerns related to blockchain reactor, I would suggest to follow the new process and first come up with the ADR with a suggested modification, and then follow up with the implementation. I am fine with merging this PR, it makes more sense then the previous version, but we will probably follow up with ADR in one of the next iterations. Maybe it would make sense to clean up this PR before merge as not all suggested changes are related to blockchain reactor optimisation.

milosevic · 2018-07-20T10:31:28Z

blockchain/pool.go

 const (
-	requestIntervalMS         = 100
-	maxTotalRequesters        = 1000
+	requestIntervalMS         = 2


This will effectively mean no sleep in requestRoutine.

That's OK, it just needs to prevent a hot loop.

Revert to SetSync for saveABCIResponses() as per Ethan's feedback

codecov-io · 2018-07-20T21:49:22Z

Codecov Report

Merging #1805 into develop will decrease coverage by 0.15%.
The diff coverage is 11.62%.

@@             Coverage Diff             @@
##           develop    #1805      +/-   ##
===========================================
- Coverage     61.5%   61.34%   -0.16%     
===========================================
  Files          197      197              
  Lines        15670    15684      +14     
===========================================
- Hits          9638     9622      -16     
- Misses        5215     5247      +32     
+ Partials       817      815       -2

Impacted Files	Coverage Δ
blockchain/pool.go	`66.43% <0%> (-3.15%)`	⬇️
state/store.go	`64.17% <100%> (ø)`	⬆️
blockchain/reactor.go	`40.43% <8.1%> (-0.58%)`	⬇️
p2p/pex/errors.go	`25% <0%> (-25%)`	⬇️
rpc/grpc/api.go	`81.81% <0%> (-7.08%)`	⬇️
p2p/pex/pex_reactor.go	`72.81% <0%> (-0.68%)`	⬇️
p2p/pex/addrbook.go	`69.36% <0%> (-0.5%)`	⬇️
p2p/peer.go	`60.22% <0%> (ø)`	⬆️
rpc/core/net.go	`0% <0%> (ø)`	⬆️
... and 2 more

jaekwon · 2018-07-20T22:17:25Z

@zmanian How about if we check in the blockpool when we recieve block n if n-1 is in the pool and if it is we do a signature verification then on n-1 and then have a flag on the block is sig verification passed and then the blockchain reactor just does the remaining verification work.

That seems like a great idea. Given the interest of time, lets work on this (or maybe other designs) after launch.

@milosevic During processing of bcBlockRequestMessage that is executed in the p2p receive routine, we are reading block from disc and block during that time p2p receive routine.

Great point, that probably makes a significant difference, though I don't know how much.

@milosevic We are locking global mutex when handling of several requests, creating potentially contention on this lock and potentially also blocking p2p receive routine.

There's definitely lock contention, though it isn't clear how bad it is. For this and the above point, I agree we should profile and get a clear picture of current performance and document the procedure, as it will help us understand how to optimize code elsewhere as well.

I think for now we can merge this PR, and make another issue to address the great ideas in the comments here.

jaekwon · 2018-07-20T22:21:42Z

Also, so I don't forget, another idea for optimization is to use the LastBlockHash to verify a chain of blocks without having to verify signatures for every block. This idea shouldn't be implemented lightly, as it might have consequences for PoS security, possibly DDoS attack vectors (depending on how it's implemented), peer selection strategy (it might make more sense to request a contiguous range from a single peer), etc.

Optimizing blockchain reactor.

8128627

Should be paired with cosmos/iavl#65.

jaekwon requested review from ebuchman, melekes and xla as code owners June 23, 2018 05:01

melekes approved these changes Jun 23, 2018

View reviewed changes

ebuchman reviewed Jun 23, 2018

View reviewed changes

zmanian added the C:sync Component: Fast Sync, State Sync label Jun 24, 2018

ebuchman suggested changes Jul 1, 2018

View reviewed changes

xla assigned milosevic Jul 9, 2018

milosevic reviewed Jul 19, 2018

View reviewed changes

milosevic approved these changes Jul 20, 2018

View reviewed changes

Update store.go

b41b897

Revert to SetSync for saveABCIResponses() as per Ethan's feedback

milosevic mentioned this pull request Jul 21, 2018

Create blockchain-reactor ADR to address several raised concerns #2026

Closed

xla assigned ebuchman and unassigned milosevic Jul 23, 2018

Merge branch 'develop' into jae/optimize_blockchain

e1b48b1

ebuchman force-pushed the jae/optimize_blockchain branch from 1031730 to 54d753e Compare July 24, 2018 02:38

ebuchman mentioned this pull request Jul 24, 2018

types: canonical time should use RFC3339Nano #2037

Closed

ebuchman approved these changes Jul 24, 2018

View reviewed changes

ebuchman merged commit b92860b into develop Jul 24, 2018

ebuchman deleted the jae/optimize_blockchain branch July 24, 2018 02:46

fix Gopkg, add changelog

54d753e

jaekwon mentioned this pull request Jul 19, 2019

tmlibs db wrapper for testing tendermint/tm-db#8

Closed

Conversation

jaekwon commented Jun 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

melekes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebuchman commented Jun 23, 2018

Uh oh!

zmanian commented Jun 24, 2018

Uh oh!

ebuchman commented Jun 25, 2018

Uh oh!

zmanian commented Jun 25, 2018

Uh oh!

milosevic commented Jun 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milosevic commented Jul 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Jul 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jaekwon commented Jul 20, 2018

Uh oh!

jaekwon commented Jul 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jaekwon commented Jun 23, 2018 •

edited

Loading

codecov-io commented Jul 20, 2018 •

edited

Loading