Conversation
Should be paired with cosmos/iavl#65.
| continue FOR_LOOP | ||
| } else { | ||
| // Try again quickly next loop. | ||
| didProcessCh <- struct{}{} |
There was a problem hiding this comment.
This won't lead to deadlock, right? otherwise, optional write (with default) would be here
There was a problem hiding this comment.
Looks like it shouldn't since we read an element off the channel to start this block and nothing else writes to the channel except a mutually exclusive block of code. so this should be fine, but it's not immediately obvious
state/store.go
Outdated
| // Responses are indexed by height so they can also be loaded later to produce Merkle proofs. | ||
| func saveABCIResponses(db dbm.DB, height int64, abciResponses *ABCIResponses) { | ||
| db.SetSync(calcABCIResponsesKey(height), abciResponses.Bytes()) | ||
| db.Set(calcABCIResponsesKey(height), abciResponses.Bytes()) |
There was a problem hiding this comment.
we need to flush these to disk incase we crash before we get to call SaveState
There was a problem hiding this comment.
Though I would have thought the persistence tests would have caught this if it was wrong. Hmph.
There was a problem hiding this comment.
You're right, we should flush these, cuz otherwise we'll commit to the app and possibly crash, losing the responses.
I think it's difficult to catch this because to catch it, it isn't sufficient to fail after saveABCIResponse(), but we need to also call updateState+blockExec.Commit and then fail (exit) before leveldb finalizes...
I think we can simulate this by implementing a special wrapper around DB that delays .Set() for many seconds until maybe the next *Sync call. I'll make an issue of it.
There was a problem hiding this comment.
|
@ValarDragon has indicated that we can cut signature verification time in half by using batch verification. @milosevic recently wrote a spec for this reactor here: https://github.com/tendermint/tendermint/blob/develop/docs/spec/reactors/block_sync/reactor.md Also note some previously identified issues in #766 |
|
I have a somewhat fundamental question here. Is it optimal that we need information from 2 blocks to verify signatures? It seems like we could include all the information needed to verify signatures in each block, do signature verification in This would allow for massive parallelism on signature verification and immediately punish peers who send us bad sigs. |
This would come at the cost of slowing down consensus and would be a non-trivial change. Currently we consider having seen a commit for block H when we see the +2/3 precommits for the block H, but the commit only becomes canonical and included in the blockchain when its included as the LastCommit of the next block H+1 |
|
How about if we check in the blockpool when we recieve block n if n-1 is in the pool and if it is we do a signature verification then on n-1 and then have a flag on the block is sig verification passed and then the blockchain reactor just does the remaining verification work. |
|
@jaekwon You might want to take a look at this issue: #1734. The requester choice strategy seems like not optimal. This might lead to bad utilisation of peer resources during fast sync protocol. Furthermore, there are some design choices in blockchain reactor that might also potentially negatively affect performance:
Before optimising blockchain reactor, maybe it make sense to first profile its current performance so we understand bottlenecks so we are able to measure impact of some potential modifications. |
| continue FOR_LOOP | ||
| } else { | ||
| // Try again quickly next loop. | ||
| didProcessCh <- struct{}{} |
There was a problem hiding this comment.
Looks like it shouldn't since we read an element off the channel to start this block and nothing else writes to the channel except a mutually exclusive block of code. so this should be fine, but it's not immediately obvious
|
|
||
| // TimeFormat is used for generating the sigs | ||
| const TimeFormat = "2006-01-02T15:04:05.000Z" | ||
| const TimeFormat = time.RFC3339Nano |
There was a problem hiding this comment.
Is there some reason for this ? I believe this makes the PR a breaking change, while it would otherwise not be
There was a problem hiding this comment.
Consistency. We should use Nano everywhere.
|
|
||
| const capacity = 1000 // must be bigger than peers count | ||
| requestsCh := make(chan BlockRequest, capacity) | ||
| requestsCh := make(chan BlockRequest, maxTotalRequesters) |
There was a problem hiding this comment.
What is a benefit of this change?
There was a problem hiding this comment.
it's just some reasonable constant that is already defined. vs having a magic 1000 in the middle of a function declaration.
|
The suggested change brings several issues, but I am not sure that we properly address them with this PR.
As there are also several other concerns related to blockchain reactor, I would suggest to follow the new process and first come up with the ADR with a suggested modification, and then follow up with the implementation. I am fine with merging this PR, it makes more sense then the previous version, but we will probably follow up with ADR in one of the next iterations. Maybe it would make sense to clean up this PR before merge as not all suggested changes are related to blockchain reactor optimisation. |
| const ( | ||
| requestIntervalMS = 100 | ||
| maxTotalRequesters = 1000 | ||
| requestIntervalMS = 2 |
There was a problem hiding this comment.
This will effectively mean no sleep in requestRoutine.
There was a problem hiding this comment.
That's OK, it just needs to prevent a hot loop.
Revert to SetSync for saveABCIResponses() as per Ethan's feedback
Codecov Report
@@ Coverage Diff @@
## develop #1805 +/- ##
===========================================
- Coverage 61.5% 61.34% -0.16%
===========================================
Files 197 197
Lines 15670 15684 +14
===========================================
- Hits 9638 9622 -16
- Misses 5215 5247 +32
+ Partials 817 815 -2
|
That seems like a great idea. Given the interest of time, lets work on this (or maybe other designs) after launch.
Great point, that probably makes a significant difference, though I don't know how much.
There's definitely lock contention, though it isn't clear how bad it is. For this and the above point, I agree we should profile and get a clear picture of current performance and document the procedure, as it will help us understand how to optimize code elsewhere as well. I think for now we can merge this PR, and make another issue to address the great ideas in the comments here. |
|
Also, so I don't forget, another idea for optimization is to use the LastBlockHash to verify a chain of blocks without having to verify signatures for every block. This idea shouldn't be implemented lightly, as it might have consequences for PoS security, possibly DDoS attack vectors (depending on how it's implemented), peer selection strategy (it might make more sense to request a contiguous range from a single peer), etc. |
1031730 to
54d753e
Compare
With this change and cosmos/iavl#65 and tmlibs reverse iterators, the bottleneck I think is the speed of signature validation. Before optimizing that it might make sense to consider a different kind of blockchain syncing strategy, and before that, straight up state syncing.
We should implement state syncing soon after launch. :)
We're still limited to around 100 blocks/sec on the testnet, but state syncing won't be too difficult to implement given RangeProof.