Skip to content

raftentry: fix accidental no-op in truncateFrom#63302

Merged
craig[bot] merged 3 commits intocockroachdb:masterfrom
tbg:truncate-to-bug
Apr 12, 2021
Merged

raftentry: fix accidental no-op in truncateFrom#63302
craig[bot] merged 3 commits intocockroachdb:masterfrom
tbg:truncate-to-bug

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Apr 8, 2021

When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, 100, 101, 102 to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( 103 and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.

It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains

[101, 102, 103, 104]

and we are attempting to add a slice of large entries at indexes [50, 51, 52] which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).

What should be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index 50, whereas the first
cached index would be 101).

It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (#61990) of the
hotspotsplits/nodes=4 roachtest (which was only caught thanks to
#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like

idx 101 102 103 104 105 106 107 108
term 9 9 9 9 9 9 9 9

the divergent follower would have this:

idx 101 102 103 104 105 106 107 108
term 9 9 7 7 7 9 9 9

This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.

From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).

A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). hotspotsplits brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).

The history of this bug is actually interesting. The code that has
the bug was introduced in #36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.

A rough calculation suggests that we spent around 800.000 minutes of
n2-standard-4 instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.

Closes #61990.

Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@tbg tbg force-pushed the truncate-to-bug branch from ad7d6c1 to cae2c1c Compare April 8, 2021 11:20
@tbg tbg marked this pull request as ready for review April 8, 2021 11:22
@tbg tbg requested review from a team and nvb April 8, 2021 11:22
Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work!

Reviewed 4 of 4 files at r1, 4 of 4 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvserver/raftentry/cache.go, line 381 at r2 (raw file):

		}
		if i != 0 && e.Term < prevTerm {
			err := errors.Errorf("term regression idx %d: %d -> %d", prevIndex, e.Term, prevTerm)

I think e.Term and prevTerm should be swapped here.

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent sleuthing. Some drive-by comments because I was curious about this. I'm not at all familiar with this code.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvserver/raftentry/ring_buffer.go, line 75 at r2 (raw file):

			// (say the buf is idx=[100, 101, 102] and `lo` is 99),
			// we need to truncate everything.
			it, ok = iterateFrom(b, first(b).index(b))

first(b) is an iterator, which you're retrieving the index of to create another iterator that is at the index of first(b). I think it == first(b) unless I'm totally misunderstanding something. Which means you can simplify this to:

if f := first(b); f.index(b) > lo {
  it, ok = f, true
}

The semantics of iterateFrom are a bit surprising. I would think it would always return an iterator if there are any entries >= index. Did you consider putting this logic in iterateFrom? Would doing so break other use cases. It looks like ringBuf.get would need to be fixed, but this would fix ringBuf.scan.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, 4 of 4 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @petermattis, and @tbg)


pkg/kv/kvserver/raftentry/cache.go, line 187 at r1 (raw file):

	}
	if truncate {
		// Note that if !add, ents[0].Index may not even be in the cache

I think there's still an odd case here where add = true, truncate = true but we are attempting to insert a set of non-overlapping, earlier entries. In that case, it appears that ringBuf.add will be a no-op because it can't allow gaps. By definition, this is a case where truncating from the first index in ents will be equivalent to truncating from the last entry in ents - both broken before this PR, both fixed after it.

However, even after this PR, I think this will lead to the cache being empty instead of the higher-index entries being truncated and replaced by the lower-index entries. So we'll immediately miss the cache next time we try to query for those new entries.

Do you mind trying to create this scenario in a test? It will be something like add([10, 11, 12], true), add([4, 5, 6], true). We should end up with [4, 5, 6], but I suspect we'll end up with [].

This subtle interaction leads me to believe that we should be truncating before adding. We should probably also remove the Some entries were already overwritten branch to avoid the truncate operation's correctness depending on the add operation doing something that it doesn't always promise to do - overwriting every entry.


pkg/kv/kvserver/raftentry/cache.go, line 381 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I think e.Term and prevTerm should be swapped here.

I agree.


pkg/kv/kvserver/raftentry/cache_test.go, line 196 at r1 (raw file):

	// Cache has entries 4, 5. Offer an oversize entry at index 3 (which is
	// notably before 4) and request truncation. This should clear all entries
	//>= 3, i.e. everything.

nit: spacing at beginning of line.


pkg/kv/kvserver/raftentry/ring_buffer.go, line 75 at r2 (raw file):

Which means you can simplify this to

+1 I also find this easier to understand, but we may want to make it even more clearly max(lo, first). How about something like:

it, ok := iterateFrom(b, max(lo, first(b).index(b)))

but this would fix ringBuf.scan

I'm not sure that it would. The contract of Scan is "If any entries are returned for the specified indices, they will start with index lo and proceed sequentially without gaps". So we seem to prefer returning nothing over skipping the lower bound index and returning other entries.


pkg/kv/kvserver/raftentry/ring_buffer.go, line 91 at r2 (raw file):

		if b.len > 0 {
			if lastIdx := last(b).index(b); lastIdx >= lo {
				err := errors.Errorf(

nit: why the separate variable definition?

@tbg tbg requested review from erikgrinaker, nvb and petermattis April 9, 2021 14:27
Copy link
Copy Markdown
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews! Feedback addressed (all in a new commit). Appreciate another close look before I merge this.

Dismissed @erikgrinaker and @nvanbenschoten from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @petermattis)


pkg/kv/kvserver/raftentry/cache.go, line 187 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think there's still an odd case here where add = true, truncate = true but we are attempting to insert a set of non-overlapping, earlier entries. In that case, it appears that ringBuf.add will be a no-op because it can't allow gaps. By definition, this is a case where truncating from the first index in ents will be equivalent to truncating from the last entry in ents - both broken before this PR, both fixed after it.

However, even after this PR, I think this will lead to the cache being empty instead of the higher-index entries being truncated and replaced by the lower-index entries. So we'll immediately miss the cache next time we try to query for those new entries.

Do you mind trying to create this scenario in a test? It will be something like add([10, 11, 12], true), add([4, 5, 6], true). We should end up with [4, 5, 6], but I suspect we'll end up with [].

This subtle interaction leads me to believe that we should be truncating before adding. We should probably also remove the Some entries were already overwritten branch to avoid the truncate operation's correctness depending on the add operation doing something that it doesn't always promise to do - overwriting every entry.

👍 thanks, this is a good simplification. Test case is added and passes now (it did not originally, just like you suspected).


pkg/kv/kvserver/raftentry/ring_buffer.go, line 75 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Which means you can simplify this to

+1 I also find this easier to understand, but we may want to make it even more clearly max(lo, first). How about something like:

it, ok := iterateFrom(b, max(lo, first(b).index(b)))

but this would fix ringBuf.scan

I'm not sure that it would. The contract of Scan is "If any entries are returned for the specified indices, they will start with index lo and proceed sequentially without gaps". So we seem to prefer returning nothing over skipping the lower bound index and returning other entries.

Done.

Copy link
Copy Markdown
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @petermattis from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @petermattis)


pkg/kv/kvserver/raftentry/ring_buffer.go, line 75 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Done.

Done.

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I'm too distant from this code to give it a real stamp of approval.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 5 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)


pkg/kv/kvserver/raftentry/cache.go, line 184 at r3 (raw file):

	var bytesAdded, entriesAdded, bytesRemoved, entriesRemoved int32
	// We truncate (if requested) before adding. This can lead to "wasted"
	// work we're were zeroing out entries we would repopulate anyway, but

s/we're were/where we're/


pkg/kv/kvserver/raftentry/cache_test.go, line 104 at r3 (raw file):

		act = append(act, it.index(b))
		if len(act) > 1000 {
			panic("x")

Did you mean to say something here?


pkg/kv/kvserver/raftentry/ring_buffer.go, line 76 at r3 (raw file):

		// (say the buf is idx=[100, 101, 102] and `lo` is 99),
		// we need to truncate everything and need to use the
		// first index that gives us a valid iterator back.

I would point out explicitly that iterateFrom returns !ok if given an index below the first index in the buffer.


pkg/kv/kvserver/raftentry/ring_buffer.go, line 80 at r3 (raw file):

	}
	it, ok := iterateFrom(b, lo)
	if !ok {

Do we need this early return? The rest of the method works without it, which is probably an indication that we shouldn't add unnecessary branching which could eventually hide a bug.


pkg/kv/kvserver/raftentry/ring_buffer_test.go, line 17 at r3 (raw file):


	"github.com/stretchr/testify/require"
)

Think it's worth adding testing for each of the methods in the ringBuf's external interface (those in rangeCache)? These may be a bit redundant with the testing we already have for Cache, but given how much difficulty we seem to be having with getting this to be correct, tests at a few layers may not be a bad idea.

Or maybe we could have someone with fresh eyes (and a bit less on their plate at the moment) come and add some tests here.

tbg added 2 commits April 12, 2021 10:26
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.

It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains

    [101, 102, 103, 104]

and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).

What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).

It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
 cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | 9   | 9   | 9   | 9   | 9   | 9   |

the divergent follower would have this:

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | *7* | *7* | *7* | 9   | 9   | 9   |

This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.

From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).

A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).

The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.

A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.

Closes cockroachdb#61990.

Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
This productionizes a number of assertions that were added in cockroachdb#62791,
while investigating cockroachdb#61990.
We don't have a good story about whether assertions always need to
be behind a build tag. I added the assertions that are dirt cheap
without a build tag, and where I augmented an existing assertion
I kept it behind the existing build tag gating.

If any of these assertions fires, it's a good chance that we are facing
potential log corruption (the one in the entry cache is often benign,
but once it leads to a term regression, it's a very likely bad time), so
if anything, in the (relatively distant) future these assertions should
fail "gracefully" at the replica level as opposed to taking down the
node; we wouldn't want to disable them. Since we can't fail assertions
at the replica level now, per-node is fine and they will have plenty
of time to bake until they make it into a release.

Release note: None
@tbg tbg force-pushed the truncate-to-bug branch 2 times, most recently from f585e4d to ab4f04c Compare April 12, 2021 11:38
- simplify `lo` replacement in truncateFrom
- truncate before adding in `(*Cache).Add()`
- add unit test coverage for ringBuf

This commit also fixes a bug where `first()` did not return an invalid
iterator as advertised. Since the resulting iterator's `.next()` would
wrap around, the iterator would accidentally stay valid forever and
traverse the buffer. `first()` now actually returns an invalid iterator,
and so does `last()`, and a full code audit of all callers was carried
out.

The tests were updated to use simplified entry sizes. Previously, the
entry size in tests sized the `ent.Data` slice, but the ring buffer
uses `ent.Sized()`. Now we reverse engineer entries of the correct
size in testing, which makes things vastly more easy to follow in
test cases.

Release note: None
@tbg tbg force-pushed the truncate-to-bug branch from ab4f04c to 6724103 Compare April 12, 2021 13:15
@tbg tbg requested a review from nvb April 12, 2021 13:15
Copy link
Copy Markdown
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

bors r=nvanbenschoten,erikgrinaker

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker and @nvanbenschoten)


pkg/kv/kvserver/raftentry/cache_test.go, line 104 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did you mean to say something here?

Oops, this is leftover debugging from fixing the bug where an iter on the empty buf accidentally ended up always valid. Removed.


pkg/kv/kvserver/raftentry/ring_buffer.go, line 80 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need this early return? The rest of the method works without it, which is probably an indication that we shouldn't add unnecessary branching which could eventually hide a bug.

We don't, good call.


pkg/kv/kvserver/raftentry/ring_buffer_test.go, line 17 at r3 (raw file):

Or maybe we could have someone with fresh eyes (and a bit less on their plate at the moment) come and add some tests here.

Heh, that'll never happen. I wrote unit tests. They didn't turn up anything, but it's nice to have them.

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 12, 2021

Build succeeded:

@irfansharif
Copy link
Copy Markdown
Contributor

irfansharif commented Apr 12, 2021

We should mint this PR as an NFT. Starting bid at $39,000.

@tbg tbg deleted the truncate-to-bug branch April 13, 2021 08:11
@rafiss rafiss added this to the 21.1 milestone Apr 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: hotspotsplits/nodes=4 failed [inconsistency]

7 participants