[blockchain] v2-riri Scheduler implementation#3821
[blockchain] v2-riri Scheduler implementation#3821brapse wants to merge 70 commits intotendermint:masterfrom
Conversation
added block requests under peer moved the request trigger in the reactor poolRoutine, triggered now by a ticker in general moved everything required for making block requests smarter in the poolRoutine added a simple map of heights to keep track of what will need to be requested next added a few more tests
send errors (RemovePeer) from switch on a different channel than the one receiving blocks renamed channels added more pool tests
If we tried to send a request to a peer not present in the switch, a missing continue statement caused the request to be blackholed in a peer that was removed and never retried. While this bug was manifesting, the reactor kept asking for other blocks that would be stored and never consumed. Added the number of unconsumed blocks in the math for requesting blocks ahead of current processing height so eventually there will be no more blocks requested until the already received ones are consumed.
| return Skip{} | ||
| } | ||
|
|
||
| type scPrunePeerEv struct { |
There was a problem hiding this comment.
U1000: type scPrunePeerEv is unused (from unused)
| reason error | ||
| } | ||
|
|
||
| func (sdr *Scheduler) handleBlockProcessError(peerID p2p.ID, height int64) Event { |
There was a problem hiding this comment.
U1000: func (*Scheduler).handleBlockProcessError is unused (from unused)
| return Skip{} | ||
| } | ||
|
|
||
| type scSchedulerFailure struct { |
There was a problem hiding this comment.
U1000: type scSchedulerFailure is unused (from unused)
| reason error | ||
| } | ||
|
|
||
| func (sdr *Scheduler) handleTimeCheck(now time.Time) Events { |
There was a problem hiding this comment.
U1000: func (*Scheduler).handleTimeCheck is unused (from unused)
ancazamfir
left a comment
There was a problem hiding this comment.
Some comments on scheduler.go. Will do the tests later.
| return &sc | ||
| } | ||
|
|
||
| func (sc *schedule) addPeer(peerID p2p.ID) error { |
There was a problem hiding this comment.
Should addPeer have a height parameter? What triggers a peer addition the way it is right now?
In the older version it is the first StatusResponseMessage that creates a new peer with a height.
There was a problem hiding this comment.
This would most likely be pass passed an event form reactor.AddPeer
|
|
||
| func (sc *schedule) addPeer(peerID p2p.ID) error { | ||
| if _, ok := sc.peers[peerID]; ok { | ||
| return fmt.Errorf("Cannot add duplicate peer %s", peerID) |
There was a problem hiding this comment.
Should a check for removed state be added here? What happens if a peer was removed and later re-added?
There was a problem hiding this comment.
That's a good question! what does @milosevic think?
|
|
||
| peer.height = height | ||
| peer.state = peerStateReady | ||
| for i := sc.minHeight(); i <= height; i++ { |
There was a problem hiding this comment.
Asssume we start at height 1 and we have a peer that declares a very high height. This loop may take a long time and also add a lot of entries to blockStates.
There was a problem hiding this comment.
Yes this on the surface seems pretty inefficient. The goal is to ensure correctness first and then look at performance it once the spec is somehow well defined.
There was a problem hiding this comment.
It's not only a performance factor but also a safety/ security concern. If a "bad" node sends a huge height, setPeerHeight() will hog the CPU potentially forever. Try for example TestPeerHeight() with a height of 2^40 and observe blockchain.test process CPU and memory usage.
| } | ||
|
|
||
| func (sc *schedule) allBlocksProcessed() bool { | ||
| for _, state := range sc.blockStates { |
There was a problem hiding this comment.
I assume this is called every time a block is processed. The range loop is expensive given the amount of entries in blockStates map.
There was a problem hiding this comment.
This would be called at some interval. The same comment as above, let's make sure it's correct and then profile to figure out where best to optimize .
| return schedulerErrorEv{peerID, err} | ||
| } | ||
|
|
||
| return scBlockReceivedEv{peerID} |
There was a problem hiding this comment.
Who receives this event and wouldn't the height be needed?
There was a problem hiding this comment.
Good catch! Yes this will be sent to the processor and require the height, blockData, etc.
|
This PR was replaced by #3848 |
fork contains updated dependecies <!-- Please add a reference to the issue that this PR addresses and indicate which files are most critical to review. If it fully addresses a particular issue, please include "Closes #XXX" (where "XXX" is the issue number). If this PR is non-trivial/large/complex, please ensure that you have either created an issue that the team's had a chance to respond to, or had some discussion with the team prior to submitting substantial pull requests. The team can be reached via GitHub Discussions or the Cosmos Network Discord server in the #cometbft channel. GitHub Discussions is preferred over Discord as it allows us to keep track of conversations topically. https://github.com/cometbft/cometbft/discussions If the work in this PR is not aligned with the team's current priorities, please be advised that it may take some time before it is merged - especially if it has not yet been discussed with the team. See the project board for the team's current priorities: https://github.com/orgs/cometbft/projects/1 --> --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments
The blockchain reactor is responsible for fetching and applying blocks from peers in order to catch up a node. #3777 specifies a new architecture architecture which includes a scheduler component which:
This PR is one such scheduler modeled as a block and peer wise finite state machine. Blocks transition from:
unknown -> new -> pending -> received -> processedand peers transition from:
new -> ready -> removedbased on a synchronous methods.