raftentry: truncate entries from old terms when adding to cache by nvb · Pull Request #36360 · cockroachdb/cockroach

nvb · 2019-03-30T23:17:25Z

See #36353 (comment) for the reasoning behind how this could lead to the inconsistency observed in that issue.

This change adds an option to raftentry.Cache's Add method to truncate entries from old terms when inserting new entries into the cache. This ensures that old entries at lower terms are purged from the cache and will never be provided to etcd/raft. The approach here is analogous to the truncation in Replica.append, but for the Raft entry cache instead of for persistent Raft entries.

cockroach-teamcity · 2019-03-30T23:17:31Z

This change is

ajwerner

thanks for jumping on this

Reviewed 6 of 6 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/storage/raftentry/cache.go, line 217 at r1 (raw file):

// Scan returns entries between [lo, hi) for specified range. If any entries are
// returned for the specified indices, they will start with index lo and proceed

Don’t tell #teamindexes

petermattis · 2019-03-31T00:16:44Z

@nvanbenschoten Is this issue new in 19.1, or has it existed for a while? Are we going to want to backport this to 2.1?

petermattis · 2019-03-31T00:17:10Z

PS This is an excellent find!

ajwerner · 2019-03-31T15:25:50Z

Is this issue new in 19.1, or has it existed for a while? Are we going to want to backport this to 2.1?

I’m pretty sure it’s new. The old implementation of the raft entry cache always added entries before proceeding to evict them.

nvb · 2019-03-31T21:39:37Z

Yeah, I think it's new, but not because we ever explicitly truncated entries from earlier terms with larger indexes. I think it's just because we never decided not to add new entries to the cache (which overwrote entries from older terms) and etcd/raft never asked for indices larger than the tail of the current term.

nvb

Added a test for the new behavior. PTAL.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)

pkg/storage/raftentry/cache.go, line 217 at r1 (raw file):

Previously, ajwerner wrote…

Don’t tell #teamindexes

This is #teamletspickonespellingforthesamefile 😃

tbg

with some questions

Reviewed 4 of 6 files at r1, 2 of 2 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @nvanbenschoten)

pkg/storage/raftentry/cache.go, line 151 at r2 (raw file):

	// Get p and move the partition to the front of the LRU.
	p := c.getPartLocked(id, add /* create */, true /* recordUse */)
	if bytesGuessed > 0 {

Is this better than saying if add {? Or is one correct and the other isn't? I have a hard time reading whether we want anything special to happen if analyzeEntries tells us bytesGuessed := 0. Is that a thing? In that case it seems that we'd want to enter here but wouldn't.

pkg/storage/raftentry/cache.go, line 168 at r2 (raw file):

	}
	c.mu.Unlock()
	if p == nil /* && !add */ {

?

pkg/storage/raftentry/cache.go, line 180 at r2 (raw file):

	}
	if truncate {
		truncIdx := ents[0].Index

Is the case in which this one is used exercised?

pkg/storage/raftentry/cache_test.go, line 184 at r2 (raw file):

that is not

ajwerner

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @tbg)

pkg/storage/raftentry/cache.go, line 151 at r2 (raw file):

Is that a thing?
I don't think that's a thing but if it were we wouldn't want to enter here. This logic is about evicting to create space. analyzeEntries always over guesses. If the upper bound of the space we need is 0 then we don't need to do any eviction.

nvb

TFTRs!

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @tbg)

pkg/storage/raftentry/cache.go, line 151 at r2 (raw file):

Previously, ajwerner wrote…

Is that a thing?
I don't think that's a thing but if it were we wouldn't want to enter here. This logic is about evicting to create space. analyzeEntries always over guesses. If the upper bound of the space we need is 0 then we don't need to do any eviction.

analyzeEntries() == 0 is only a thing if len(ents) == 0. Since we guard against that above, bytesGuessed == 0 is equivalent to !add. I'm fine with either, although I went with this because bytesGuessed is more directly what we're concerned about here for the reasons @ajwerner listed.

pkg/storage/raftentry/cache.go, line 168 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

?

This is now possible because we don't create missing partitions iff !add. The comment is meant to signify p == nil => !add. I'm happy to remove it if it's confusing.

pkg/storage/raftentry/cache.go, line 180 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is the case in which this one is used exercised?

Yes, these two cases are each represented in TestAddAndTruncate.

pkg/storage/raftentry/cache_test.go, line 184 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

that is not

Done.

tbg

Reviewed 1 of 1 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @nvanbenschoten)

pkg/storage/raftentry/cache.go, line 168 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is now possible because we don't create missing partitions iff !add. The comment is meant to signify p == nil => !add. I'm happy to remove it if it's confusing.

That comment style triggers suspicion that it was accidentally included (for example while testing out something locally), just make it explicit.

Fixes cockroachdb#36353. See cockroachdb#36353 (comment) for the reasoning behind how this could lead to the inconsistency observed in that issue. This change adds an option to `raftentry.Cache`'s `Add` method to truncate entries from old terms when inserting new entries into the cache. This ensures that old entries at lower terms are purged from the cache and will never be provided to `etcd/raft`. The approach here is analogous to the truncation in [`Replica.append`](https://github.com/cockroachdb/cockroach/blob/a6b3c540b696002b2ed07036a657612995d6d1ab/pkg/storage/replica_raftstorage.go#L621), but for the Raft entry cache instead of for persistent Raft entries. Release note: None

nvb

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @tbg)

pkg/storage/raftentry/cache.go, line 168 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

That comment style triggers suspicion that it was accidentally included (for example while testing out something locally), just make it explicit.

Done.

craig · 2019-04-01T16:57:23Z

Build failed

GitHub CI (Cockroach)

nvb · 2019-04-01T17:03:40Z

OOM in TestLint/TestVet.

bors r+

36360: raftentry: truncate entries from old terms when adding to cache r=nvanbenschoten a=nvanbenschoten Fixes #36353. See #36353 (comment) for the reasoning behind how this could lead to the inconsistency observed in that issue. This change adds an option to `raftentry.Cache`'s `Add` method to truncate entries from old terms when inserting new entries into the cache. This ensures that old entries at lower terms are purged from the cache and will never be provided to `etcd/raft`. The approach here is analogous to the truncation in [`Replica.append`](https://github.com/cockroachdb/cockroach/blob/a6b3c540b696002b2ed07036a657612995d6d1ab/pkg/storage/replica_raftstorage.go#L621), but for the Raft entry cache instead of for persistent Raft entries. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

craig · 2019-04-01T17:20:48Z

Build succeeded

GitHub CI (Cockroach)

When entries are appended to the raft log, they are also added to the raft entry cache. Appending indexes, say, `100, 101, 102` to the raft log has the (absolutely crucial for correctness) effect of removing any higher indexes in the raft log ( `103` and up) and for correctness, the raft entry cache must follow the same semantics, so that it is always a coherent subset of the raft log. It turns out it didn't do that in one edge case, namely when appending a slice of entries that didn't fit into the cache, but whose first entry preceded the currently cached entries. For a concrete example, imagine the cache contains [101, 102, 103, 104] and we are attempting to add a slice of large entries at indexes `[50, 51, 52]` which can not fit the cache at all (i.e. their aggregate size exceeds that of the cache). What *should* be the result of this addition is an empty cache. This is because the true raft log now ends at index 52. In reality though, the cache accidentally turned this operation into a no-op, as it was using an iterator that would be invalid when positioned to an index not currently cached (in this example, index `50`, whereas the first cached index would be `101`). It took us a while to identify this bug, as it manifested itself to us only through a very rare replica inconsistency failure (cockroachdb#61990) of the `hotspotsplits/nodes=4` roachtest (which was only caught thanks to cockroachdb#36241) that would then take days to reproduce. When we finally managed to get a reproduction that contain the entire history of disk writes for the cluster, we found that the inconsistency had occured due to entries from a past term being replicated to the divergent follower. Concretely, where the leader had something like | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | the divergent follower would have this: | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 | This made it likely that the issue was connected to a raft leadership change in which the "old" leader for term 7 had proposed entries for indexes up to and including at least index 105, which the term 8 and 9 leaders had then discarded and replaced by entries for term 9. Of course, these entries should consequently never have been committed and in particular, the term can never regress in the raft log, even for uncommitted entries, at least when Raft is working correctly. From the WAL writes, we were able to ascertain that the leader (who itself had the correct writes in the log) had in fact replicated the incorrect (term 7) entries to the follower. This meant that that we were either looking at pebble flat-out serving stale reads (highly unlikely) or a bug in the raft entry cache. Being unable to find issues through direct inspection, we added invariants and tweaked the test to exacerbate what we considered the triggering conditions (relatively frequent term changes). A crucial assertion that allowed identifying this bug was adding invariant checks (which will be re-added in a future commit) that verified that the truncation had indeed removed subsequent entries. This highlighted that we were hitting this bug in at least ~5% of hotspotsplits runs, but possibly more frequently (the assertion did not catch all instances of the bug). `hotspotsplits` brings the crucial ingredients here: it's doing concentrated 256KiB inserts which puts the raft group into mild overload; many entries pile up and get processed in aggregate, which leads to constantly taking the path in which truncations are erroneously ignored. However, most of the time omitting the eviction does not yield replica inconsistency. For this to happen, the erroneously cached entries need to be picked up by raft for sending to a follower; they need to be the "incorrect" (which necessitates a discarded log and term change at the right moment), and finally they must be in log positions at which the correct entries actually had an effect (which is to be fair mostly the case). The history of this bug is actually interesting. The code that has the bug was introduced in cockroachdb#36360. However, that PR fixed a much more egregious version of this bug - prior to that PR, the raft entry cache simply didn't evict entries at higher positions at all, i.e. it exhibited the bug in many more scenarios, and this consequently lead to more frequent inconsistencies (as referenced in the issue). So we went from bad, moderately rare bug to bad, very very rare bug and, in this PR, hopefully, to no bug. A rough calculation suggests that we spent around 800.000 minutes of `n2-standard-4` instances minutes on reproductions related to this bug in this instance alone, which I think boils down to around $39.000. I like to think we got off cheap here. Closes cockroachdb#61990. Release note (bug fix): A rare issue that could cause replica divergence was fixed. These issues would be reported by the replica consistency checker, typically within 24 hours of occurrence, which would cause nodes to terminate.

63302: raftentry: fix accidental no-op in truncateFrom r=nvanbenschoten,erikgrinaker a=tbg When entries are appended to the raft log, they are also added to the raft entry cache. Appending indexes, say, `100, 101, 102` to the raft log has the (absolutely crucial for correctness) effect of removing any higher indexes in the raft log ( `103` and up) and for correctness, the raft entry cache must follow the same semantics, so that it is always a coherent subset of the raft log. It turns out it didn't do that in one edge case, namely when appending a slice of entries that didn't fit into the cache, but whose first entry preceded the currently cached entries. For a concrete example, imagine the cache contains [101, 102, 103, 104] and we are attempting to add a slice of large entries at indexes `[50, 51, 52]` which can not fit the cache at all (i.e. their aggregate size exceeds that of the cache). What *should* be the result of this addition is an empty cache. This is because the true raft log now ends at index 52. In reality though, the cache accidentally turned this operation into a no-op, as it was using an iterator that would be invalid when positioned to an index not currently cached (in this example, index `50`, whereas the first cached index would be `101`). It took us a while to identify this bug, as it manifested itself to us only through a very rare replica inconsistency failure (#61990) of the `hotspotsplits/nodes=4` roachtest (which was only caught thanks to #36241) that would then take days to reproduce. When we finally managed to get a reproduction that contain the entire history of disk writes for the cluster, we found that the inconsistency had occured due to entries from a past term being replicated to the divergent follower. Concretely, where the leader had something like | idx | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | the divergent follower would have this: | idx | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 | This made it likely that the issue was connected to a raft leadership change in which the "old" leader for term 7 had proposed entries for indexes up to and including at least index 105, which the term 8 and 9 leaders had then discarded and replaced by entries for term 9. Of course, these entries should consequently never have been committed and in particular, the term can never regress in the raft log, even for uncommitted entries, at least when Raft is working correctly. From the WAL writes, we were able to ascertain that the leader (who itself had the correct writes in the log) had in fact replicated the incorrect (term 7) entries to the follower. This meant that that we were either looking at pebble flat-out serving stale reads (highly unlikely) or a bug in the raft entry cache. Being unable to find issues through direct inspection, we added invariants and tweaked the test to exacerbate what we considered the triggering conditions (relatively frequent term changes). A crucial assertion that allowed identifying this bug was adding invariant checks (which will be re-added in a future commit) that verified that the truncation had indeed removed subsequent entries. This highlighted that we were hitting this bug in at least ~5% of hotspotsplits runs, but possibly more frequently (the assertion did not catch all instances of the bug). `hotspotsplits` brings the crucial ingredients here: it's doing concentrated 256KiB inserts which puts the raft group into mild overload; many entries pile up and get processed in aggregate, which leads to constantly taking the path in which truncations are erroneously ignored. However, most of the time omitting the eviction does not yield replica inconsistency. For this to happen, the erroneously cached entries need to be picked up by raft for sending to a follower; they need to be the "incorrect" (which necessitates a discarded log and term change at the right moment), and finally they must be in log positions at which the correct entries actually had an effect (which is to be fair mostly the case). The history of this bug is actually interesting. The code that has the bug was introduced in #36360. However, that PR fixed a much more egregious version of this bug - prior to that PR, the raft entry cache simply didn't evict entries at higher positions at all, i.e. it exhibited the bug in many more scenarios, and this consequently lead to more frequent inconsistencies (as referenced in the issue). So we went from bad, moderately rare bug to bad, very very rare bug and, in this PR, hopefully, to no bug. A rough calculation suggests that we spent around 800.000 minutes of `n2-standard-4` instances minutes on reproductions related to this bug in this instance alone, which I think boils down to around $39.000. I like to think we got off cheap here. Closes #61990. Release note (bug fix): A rare issue that could cause replica divergence was fixed. These issues would be reported by the replica consistency checker, typically within 24 hours of occurrence, which would cause nodes to terminate. Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

When entries are appended to the raft log, they are also added to the raft entry cache. Appending indexes, say, `100, 101, 102` to the raft log has the (absolutely crucial for correctness) effect of removing any higher indexes in the raft log ( `103` and up) and for correctness, the raft entry cache must follow the same semantics, so that it is always a coherent subset of the raft log. It turns out it didn't do that in one edge case, namely when appending a slice of entries that didn't fit into the cache, but whose first entry preceded the currently cached entries. For a concrete example, imagine the cache contains [101, 102, 103, 104] and we are attempting to add a slice of large entries at indexes `[50, 51, 52]` which can not fit the cache at all (i.e. their aggregate size exceeds that of the cache). What *should* be the result of this addition is an empty cache. This is because the true raft log now ends at index 52. In reality though, the cache accidentally turned this operation into a no-op, as it was using an iterator that would be invalid when positioned to an index not currently cached (in this example, index `50`, whereas the first cached index would be `101`). It took us a while to identify this bug, as it manifested itself to us only through a very rare replica inconsistency failure (cockroachdb#61990) of the `hotspotsplits/nodes=4` roachtest (which was only caught thanks to cockroachdb#36241) that would then take days to reproduce. When we finally managed to get a reproduction that contain the entire history of disk writes for the cluster, we found that the inconsistency had occured due to entries from a past term being replicated to the divergent follower. Concretely, where the leader had something like | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | the divergent follower would have this: | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 | This made it likely that the issue was connected to a raft leadership change in which the "old" leader for term 7 had proposed entries for indexes up to and including at least index 105, which the term 8 and 9 leaders had then discarded and replaced by entries for term 9. Of course, these entries should consequently never have been committed and in particular, the term can never regress in the raft log, even for uncommitted entries, at least when Raft is working correctly. From the WAL writes, we were able to ascertain that the leader (who itself had the correct writes in the log) had in fact replicated the incorrect (term 7) entries to the follower. This meant that that we were either looking at pebble flat-out serving stale reads (highly unlikely) or a bug in the raft entry cache. Being unable to find issues through direct inspection, we added invariants and tweaked the test to exacerbate what we considered the triggering conditions (relatively frequent term changes). A crucial assertion that allowed identifying this bug was adding invariant checks (which will be re-added in a future commit) that verified that the truncation had indeed removed subsequent entries. This highlighted that we were hitting this bug in at least ~5% of hotspotsplits runs, but possibly more frequently (the assertion did not catch all instances of the bug). `hotspotsplits` brings the crucial ingredients here: it's doing concentrated 256KiB inserts which puts the raft group into mild overload; many entries pile up and get processed in aggregate, which leads to constantly taking the path in which truncations are erroneously ignored. However, most of the time omitting the eviction does not yield replica inconsistency. For this to happen, the erroneously cached entries need to be picked up by raft for sending to a follower; they need to be the "incorrect" (which necessitates a discarded log and term change at the right moment), and finally they must be in log positions at which the correct entries actually had an effect (which is to be fair mostly the case). The history of this bug is actually interesting. The code that has the bug was introduced in cockroachdb#36360. However, that PR fixed a much more egregious version of this bug - prior to that PR, the raft entry cache simply didn't evict entries at higher positions at all, i.e. it exhibited the bug in many more scenarios, and this consequently lead to more frequent inconsistencies (as referenced in the issue). So we went from bad, moderately rare bug to bad, very very rare bug and, in this PR, hopefully, to no bug. A rough calculation suggests that we spent around 800.000 minutes of `n2-standard-4` instances minutes on reproductions related to this bug in this instance alone, which I think boils down to around $39.000. I like to think we got off cheap here. Closes cockroachdb#61990. Release note (bug fix): A rare issue that could cause replica divergence was fixed. These issues would be reported by the replica consistency checker, typically within 24 hours of occurrence, which would cause nodes to terminate.

nvb requested review from a team and ajwerner March 30, 2019 23:17

ajwerner approved these changes Mar 30, 2019

View reviewed changes

nvb force-pushed the nvanbenschoten/overwriteRaftEntryCache branch 3 times, most recently from bad1a20 to a669e9f Compare March 31, 2019 00:10

nvb force-pushed the nvanbenschoten/overwriteRaftEntryCache branch 2 times, most recently from 46dbd83 to ab583c8 Compare March 31, 2019 22:45

nvb commented Mar 31, 2019

View reviewed changes

tbg approved these changes Apr 1, 2019

View reviewed changes

ajwerner approved these changes Apr 1, 2019

View reviewed changes

nvb force-pushed the nvanbenschoten/overwriteRaftEntryCache branch from ab583c8 to 6eaec4c Compare April 1, 2019 14:33

nvb commented Apr 1, 2019

View reviewed changes

tbg approved these changes Apr 1, 2019

View reviewed changes

nvb force-pushed the nvanbenschoten/overwriteRaftEntryCache branch from 6eaec4c to 162490c Compare April 1, 2019 16:17

nvb commented Apr 1, 2019

View reviewed changes

craig bot merged commit 162490c into cockroachdb:master Apr 1, 2019

nvb mentioned this pull request Apr 1, 2019

release-19.1: raftentry: truncate entries from old terms when adding to cache #36383

Merged

nvb deleted the nvanbenschoten/overwriteRaftEntryCache branch April 2, 2019 19:19

tbg mentioned this pull request Apr 8, 2021

raftentry: fix accidental no-op in truncateFrom #63302

Merged

tbg mentioned this pull request Apr 12, 2021

release-21.1: raftentry: fix accidental no-op in truncateFrom #63473

Merged

tbg mentioned this pull request Apr 12, 2021

release-20.2: raftentry: fix accidental no-op in truncateFrom #63474

Merged

tbg mentioned this pull request Apr 12, 2021

release-20.1: raftentry: fix accidental no-op in truncateFrom #63475

Merged

Conversation

nvb commented Mar 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Mar 30, 2019

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis commented Mar 31, 2019

Uh oh!

petermattis commented Mar 31, 2019

Uh oh!

ajwerner commented Mar 31, 2019

Uh oh!

nvb commented Mar 31, 2019

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

craig bot commented Apr 1, 2019

Build failed

Uh oh!

nvb commented Apr 1, 2019

Uh oh!

craig bot commented Apr 1, 2019

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nvb commented Mar 30, 2019 •

edited

Loading