raft: introduce term cache#142239
Conversation
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
5de64c8 to
2727c35
Compare
|
RDFR @pav-kv |
2727c35 to
6996946
Compare
4804cda to
17a6bdd
Compare
554034b to
57a6c2a
Compare
7e1dc75 to
b4afb24
Compare
5d82846 to
7d5bb78
Compare
7d5bb78 to
1f84f43
Compare
|
@hakuuww I've pushed all the changes from our pair programming / review session to the last commit. I'm happy with the PR as is, LGTM. Before merging, could you please clean up the commit messages of 4311190 and 471f433 to be conventional Would be good to have a second pair of eyes on this PR before merge. @tbg If you have a moment, could you skim through it? I've reviewed the tests extensively, there is 100% coverage, so interesting stuff is |
598d396 to
0247215
Compare
|
@tbg @iskettaneh I will write a more detailed PR message. Feel free to hold off on reviewing it until then. Done |
References: cockroachdb#136296 Epic: None Release note: None
References: cockroachdb#136296 Epic: None Release note: None # Conflicts: # pkg/raft/log.go
References: cockroachdb#136296 Epic: None Release note: None
0247215 to
13451ab
Compare
References: cockroachdb#136296 Epic: None Release note: None
References: cockroachdb#136296 Epic: None Release note: None
13451ab to
03b193f
Compare
tbg
left a comment
There was a problem hiding this comment.
Reviewed 1 of 7 files at r2, 5 of 5 files at r18, 1 of 1 files at r19, 1 of 1 files at r20, 2 of 2 files at r21, 3 of 3 files at r23, 2 of 2 files at r24, 1 of 1 files at r25, all commit messages.
Dismissed @pav-kv from 100 discussions.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh)
|
TYFTR! bors r=pav-kv,tbg |
|
Build succeeded: |
Previously, we avoided calling term(index) for leader commit term check by keeping the first entry index for the current leader term in memory, and comparing against that index. (see cockroachdb#137826) Now, since we have implemented term cache (which stores info of the current leader term's first entryID in this case), the term cache is used for that check to reduce the complexity of our code. (see cockroachdb#142239 ) Part of cockroachdb#136296 Fixes: cockroachdb#143362 Epic: None Release note: None
143424: raft: use term cache for leader commit term check r=pav-kv a=hakuuww Previously, we avoided calling term(index) for leader commit term check by keeping the first entry index for the current leader term in memory, and comparing against that index. (see #137826) Now, since we have implemented term cache (which stores info of the current leader term's first entryID in this case), the term cache is used for that check to reduce the complexity of our code. (see #142239 ) Part of #136296 Fixes: #143362 Epic: None Release note: None 143463: sql: refactor PlanCDCExpression r=mgartner a=mgartner #### sql: collect CDC presentation outside of plan walker The column presentation of the top node is collected to determine the output columns for the CDC expression. There is no need to do this within the plan node walker, so it has been moved outside. Release note: None #### sql: use InputCount and Input planNode methods to walk CDC plans The `Input` and `InputCount` methods of the `planNode` interface are now used to walk plan node trees of CDC expressions. This continues the effort to deprecate and remove the plan node walkers (see #137620 for more details on the motivation for this). Epic: None Release note: None #### sql: refactor return statements in PlanCDCExpression The `return` statements for error cases now explicitly return an empty `CDCExpressionPlan` for clarity. Release note: None Co-authored-by: Anthony Xu <anthony.xu@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
This PR introduces a new sub data structure
termCachetoraftLog, which stores a suffix of theraftLog'sentryIDs in a compressed representation, and helps getting a term of a particular raft entry.termCacheintegrates tightly withraftLog, which meanstermCacherelies on many assumptions guaranteed byraftLog, allowing concise implementation.First of all, this is an example of what a raftLog may look like:
entryID: term/index
[t5/10, t5/11, t5/12, t5/13, t6/14, t6/15, t6/16, t6/17, t6/18, t6/19, t7/20, t7/21, t7/22, t7/23, t7/24, t7/25, t7/26, t7/27, t8/28, t8/29, t8/30, t8/31, t10/32, t10/33, t10/34]properties of a raftLog:
entryID.Index are strictly increasing and continuous.
entryID.term are monotonically increasing, and can have gaps in between.
Those properties allow us to use term change points to express a long, continuous raftLog.
The above example raftLog can be expressed as the following in our
termCacherepresentation:(each entry is a term change point)
[t5/10, t6/14, t7/20, t8/28, t10/32]In practice, a raftLog may be hundreds/thousands of entries long, but with only a few term changes in between.
So this compressed representation allows us to represent a long raftLog's entryIDs cheaply.
Also in practice: there should only be at most 2 term changes in the raftLog in happy scenarios, anything larger is very rare. This is because compactions(applying raft entry commands to state machine) happens quite quickly.
Implementation details:
-- the benefit of having the term cache cover the whole raftLog is that any probing will only take 1rtt max. (see below on why this is possible)
One immediate benefit of doing so is that there should no longer be any raftEntry cache accesses or pebble calls when we want to know the term of a storage persisted entry. (this is assuming term flips are rare, we can still have pebble access if we want to know the term of a very early entry that is more than
termCacheSizeterms old).This helps avoid:
Currently, both of the above scenarios doesn't incur too much cost because we did a lot of tuning work on the raft entry cache in the past to make cache miss rate quite low(1-2% misses). (see #98666, #107424)
Also following this change, we can re-investigate the sizing needed for raft entry cache. Currently it is being allocated 1/256 of total system/cgroup memory, shared evenly between all stores.
Perhaps the amount of memory we allocate to the raft entry cache can be tuned again(decrease) if deemed necessary.
A second benefit is that: since we now keep a compressed representation of suffix of a raftLog, we can use this to carry more information in the raft leader probing follower process.
Currently, a raft message MsgAppResp{reject = true} from the follower only carries a single hintIndex and hintTerm.
With the term cache, we can include more information about the raftLog of a follower in its MsgAppResp with relatively low overhead. Which can be used to reduce the rtt involved in the leader/follower probing process.
Assuming we keep a few(say 4) term change points in the 'termCache', we can attach all 4 of those data points into our raft RPC messages. Which should be enough to cover the whole raftLog of a raft node.
The term cache covers entryIDs in the following range:
[raftLog.first, raftLog.last]or something like:
[entryID at commited index, raftLog.lastIndex](in real implementation we also need to attach a lastIndex, which the term cache doesn't keep, but is kept in unstable/raftLog)
When receiving this
termCacheinformation from aMsgAppResp{reject=true}orMsgVoteResp, the leader can immediately know the accurate fork point of where to send the next MsgApp. instead of doing a few probing rtts to find the fork point.(our current probing algorithm may take 2-3 rtts between Leader and follower to find a fork point in a bad raft case involving multiple leadership changes and partitions)
Part of #136296
Epic: None
Release note: None