Skip to content

raft: introduce term cache#142239

Merged
craig[bot] merged 5 commits intocockroachdb:masterfrom
hakuuww:introduceTermCacheInRaftLog
Mar 25, 2025
Merged

raft: introduce term cache#142239
craig[bot] merged 5 commits intocockroachdb:masterfrom
hakuuww:introduceTermCacheInRaftLog

Conversation

@hakuuww
Copy link
Copy Markdown
Contributor

@hakuuww hakuuww commented Mar 3, 2025

This PR introduces a new sub data structure termCache to raftLog, which stores a suffix of the raftLog 's entryIDs in a compressed representation, and helps getting a term of a particular raft entry.

termCache integrates tightly with raftLog, which means termCache relies on many assumptions guaranteed by raftLog, allowing concise implementation.


First of all, this is an example of what a raftLog may look like:
entryID: term/index

[t5/10, t5/11, t5/12, t5/13, t6/14, t6/15, t6/16, t6/17, t6/18, t6/19, t7/20, t7/21, t7/22, t7/23, t7/24, t7/25, t7/26, t7/27, t8/28, t8/29, t8/30, t8/31, t10/32, t10/33, t10/34]

properties of a raftLog:
entryID.Index are strictly increasing and continuous.
entryID.term are monotonically increasing, and can have gaps in between.


Those properties allow us to use term change points to express a long, continuous raftLog.

The above example raftLog can be expressed as the following in our termCache representation:
(each entry is a term change point)
[t5/10, t6/14, t7/20, t8/28, t10/32]

In practice, a raftLog may be hundreds/thousands of entries long, but with only a few term changes in between.
So this compressed representation allows us to represent a long raftLog's entryIDs cheaply.

Also in practice: there should only be at most 2 term changes in the raftLog in happy scenarios, anything larger is very rare. This is because compactions(applying raft entry commands to state machine) happens quite quickly.

Implementation details:

  • The term cache supports copy-on-write. Which allows a "LogSnapshot" to be taken.
  • The number of "term flip points" we keep in the term cache is currently fixed.
    • So may not cover the whole raftLog in an extreme case(multiple leader changes in a short burst duration).
    • The reason is that leader changes are very rare in practice.
    • Also this term cache is a per range/raft group thing, so it's a bit risky if we leave it unbounded. (since a server may have a million ranges)
    • Of course, we can make it an environment variable to configure the size(which I think is unnecessary).
    • If we let the term cache monitor raftLog compaction as well, we can leave the size unbounded, it will also allows was to have a guarantee that the term cache covers the whole raftLog at all times (assuming we also implement persisting term cache to storage.)
      -- the benefit of having the term cache cover the whole raftLog is that any probing will only take 1rtt max. (see below on why this is possible)

One immediate benefit of doing so is that there should no longer be any raftEntry cache accesses or pebble calls when we want to know the term of a storage persisted entry. (this is assuming term flips are rare, we can still have pebble access if we want to know the term of a very early entry that is more than termCacheSize terms old).
This helps avoid:

  • unhelpful evictions on the raftEntry cache
  • pebble access

Currently, both of the above scenarios doesn't incur too much cost because we did a lot of tuning work on the raft entry cache in the past to make cache miss rate quite low(1-2% misses). (see #98666, #107424)

Also following this change, we can re-investigate the sizing needed for raft entry cache. Currently it is being allocated 1/256 of total system/cgroup memory, shared evenly between all stores.

Perhaps the amount of memory we allocate to the raft entry cache can be tuned again(decrease) if deemed necessary.


A second benefit is that: since we now keep a compressed representation of suffix of a raftLog, we can use this to carry more information in the raft leader probing follower process.

Currently, a raft message MsgAppResp{reject = true} from the follower only carries a single hintIndex and hintTerm.
With the term cache, we can include more information about the raftLog of a follower in its MsgAppResp with relatively low overhead. Which can be used to reduce the rtt involved in the leader/follower probing process.

Assuming we keep a few(say 4) term change points in the 'termCache', we can attach all 4 of those data points into our raft RPC messages. Which should be enough to cover the whole raftLog of a raft node.
The term cache covers entryIDs in the following range:
[raftLog.first, raftLog.last]
or something like:
[entryID at commited index, raftLog.lastIndex]
(in real implementation we also need to attach a lastIndex, which the term cache doesn't keep, but is kept in unstable/raftLog)

When receiving this termCache information from a MsgAppResp{reject=true} or MsgVoteResp, the leader can immediately know the accurate fork point of where to send the next MsgApp. instead of doing a few probing rtts to find the fork point.
(our current probing algorithm may take 2-3 rtts between Leader and follower to find a fork point in a bad raft case involving multiple leadership changes and partitions)

Part of #136296
Epic: None
Release note: None

@hakuuww hakuuww requested a review from pav-kv March 3, 2025 22:27
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Mar 3, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@hakuuww hakuuww marked this pull request as ready for review March 3, 2025 22:40
@hakuuww hakuuww requested a review from a team as a code owner March 3, 2025 22:40
@hakuuww hakuuww force-pushed the introduceTermCacheInRaftLog branch 5 times, most recently from 5de64c8 to 2727c35 Compare March 7, 2025 16:32
@hakuuww
Copy link
Copy Markdown
Contributor Author

hakuuww commented Mar 7, 2025

RDFR @pav-kv

@hakuuww hakuuww requested a review from pav-kv March 7, 2025 16:36
@hakuuww hakuuww changed the title termcache: Introduce term cache in raft log raft: Introduce term cache in raft log Mar 7, 2025
@hakuuww hakuuww changed the title raft: Introduce term cache in raft log raftlog: Introduce term cache in raft log Mar 7, 2025
@hakuuww hakuuww force-pushed the introduceTermCacheInRaftLog branch from 2727c35 to 6996946 Compare March 7, 2025 16:56
@hakuuww hakuuww force-pushed the introduceTermCacheInRaftLog branch 2 times, most recently from 4804cda to 17a6bdd Compare March 11, 2025 22:16
@hakuuww hakuuww requested review from a team as code owners March 11, 2025 22:16
@hakuuww hakuuww removed request for a team March 11, 2025 22:17
@hakuuww hakuuww force-pushed the introduceTermCacheInRaftLog branch 3 times, most recently from 554034b to 57a6c2a Compare March 12, 2025 21:33
@hakuuww hakuuww force-pushed the introduceTermCacheInRaftLog branch 3 times, most recently from 7e1dc75 to b4afb24 Compare March 17, 2025 15:14
@hakuuww hakuuww force-pushed the introduceTermCacheInRaftLog branch 2 times, most recently from 5d82846 to 7d5bb78 Compare March 18, 2025 21:19
@hakuuww hakuuww force-pushed the introduceTermCacheInRaftLog branch from 7d5bb78 to 1f84f43 Compare March 20, 2025 15:00
@pav-kv pav-kv changed the title raftlog: Introduce term cache in raft log raft: introduce term cache Mar 20, 2025
@pav-kv
Copy link
Copy Markdown
Collaborator

pav-kv commented Mar 20, 2025

@hakuuww I've pushed all the changes from our pair programming / review session to the last commit. I'm happy with the PR as is, LGTM. Before merging, could you please clean up the commit messages of 4311190 and 471f433 to be conventional package: short description and remove some artifacts like # Conflicts: from them?

Would be good to have a second pair of eyes on this PR before merge. @tbg If you have a moment, could you skim through it? I've reviewed the tests extensively, there is 100% coverage, so interesting stuff is term_cache.go and its integration in log.go.

@pav-kv pav-kv requested a review from tbg March 20, 2025 18:20
@hakuuww hakuuww force-pushed the introduceTermCacheInRaftLog branch 2 times, most recently from 598d396 to 0247215 Compare March 20, 2025 22:06
@hakuuww hakuuww requested a review from iskettaneh March 20, 2025 22:08
@hakuuww
Copy link
Copy Markdown
Contributor Author

hakuuww commented Mar 20, 2025

@tbg @iskettaneh I will write a more detailed PR message. Feel free to hold off on reviewing it until then.

Done

hakuuww added 3 commits March 22, 2025 19:36
References: cockroachdb#136296
Epic: None
Release note: None

# Conflicts:
#	pkg/raft/log.go
References: cockroachdb#136296
Epic: None
Release note: None
@hakuuww hakuuww force-pushed the introduceTermCacheInRaftLog branch from 13451ab to 03b193f Compare March 24, 2025 13:53
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 7 files at r2, 5 of 5 files at r18, 1 of 1 files at r19, 1 of 1 files at r20, 2 of 2 files at r21, 3 of 3 files at r23, 2 of 2 files at r24, 1 of 1 files at r25, all commit messages.
Dismissed @pav-kv from 100 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh)

@hakuuww
Copy link
Copy Markdown
Contributor Author

hakuuww commented Mar 25, 2025

TYFTR!

bors r=pav-kv,tbg

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 25, 2025

@craig craig bot merged commit 06e0523 into cockroachdb:master Mar 25, 2025
24 checks passed
hakuuww added a commit to hakuuww/cockroach that referenced this pull request Mar 25, 2025
Previously, we avoided calling term(index) for leader commit term check
by keeping the first entry index for the current leader term in memory,
and comparing against that index. (see cockroachdb#137826)

Now, since we have implemented term cache
(which stores info of the current leader term's first entryID in this
case), the term cache is used for that check to reduce the complexity
of our code. (see cockroachdb#142239 )

Part of cockroachdb#136296
Fixes: cockroachdb#143362
Epic: None
Release note: None
craig bot pushed a commit that referenced this pull request Mar 26, 2025
143424: raft: use term cache for leader commit term check r=pav-kv a=hakuuww

Previously, we avoided calling term(index) for leader commit term check by keeping the first entry index for the current leader term in memory, and comparing against that index. (see #137826)

Now, since we have implemented term cache
(which stores info of the current leader term's first entryID in this case), the term cache is used for that check to reduce the complexity of our code. (see #142239 )

Part of #136296
Fixes: #143362
Epic: None
Release note: None

143463: sql: refactor PlanCDCExpression r=mgartner a=mgartner

#### sql: collect CDC presentation outside of plan walker

The column presentation of the top node is collected to determine the
output columns for the CDC expression. There is no need to do this
within the plan node walker, so it has been moved outside.

Release note: None

#### sql: use InputCount and Input planNode methods to walk CDC plans

The `Input` and `InputCount` methods of the `planNode` interface are now
used to walk plan node trees of CDC expressions. This continues the
effort to deprecate and remove the plan node walkers (see #137620 for
more details on the motivation for this).

Epic: None
Release note: None

#### sql: refactor return statements in PlanCDCExpression

The `return` statements for error cases now explicitly return an empty
`CDCExpressionPlan` for clarity.

Release note: None


Co-authored-by: Anthony Xu <anthony.xu@cockroachlabs.com>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants