raftentry: truncate entries from old terms when adding to cache#36360
Conversation
ajwerner
left a comment
There was a problem hiding this comment.
Reviewed 6 of 6 files at r1.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/raftentry/cache.go, line 217 at r1 (raw file):
// Scan returns entries between [lo, hi) for specified range. If any entries are // returned for the specified indices, they will start with index lo and proceed
Don’t tell #teamindexes
bad1a20 to
a669e9f
Compare
|
@nvanbenschoten Is this issue new in 19.1, or has it existed for a while? Are we going to want to backport this to 2.1? |
|
PS This is an excellent find! |
I’m pretty sure it’s new. The old implementation of the raft entry cache always added entries before proceeding to evict them. |
|
Yeah, I think it's new, but not because we ever explicitly truncated entries from earlier terms with larger indexes. I think it's just because we never decided not to add new entries to the cache (which overwrote entries from older terms) and |
46dbd83 to
ab583c8
Compare
nvb
left a comment
There was a problem hiding this comment.
Added a test for the new behavior. PTAL.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)
pkg/storage/raftentry/cache.go, line 217 at r1 (raw file):
Previously, ajwerner wrote…
Don’t tell #teamindexes
This is #teamletspickonespellingforthesamefile 😃
tbg
left a comment
There was a problem hiding this comment.
Reviewed 4 of 6 files at r1, 2 of 2 files at r2.
Reviewable status:complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @nvanbenschoten)
pkg/storage/raftentry/cache.go, line 151 at r2 (raw file):
// Get p and move the partition to the front of the LRU. p := c.getPartLocked(id, add /* create */, true /* recordUse */) if bytesGuessed > 0 {
Is this better than saying if add {? Or is one correct and the other isn't? I have a hard time reading whether we want anything special to happen if analyzeEntries tells us bytesGuessed := 0. Is that a thing? In that case it seems that we'd want to enter here but wouldn't.
pkg/storage/raftentry/cache.go, line 168 at r2 (raw file):
} c.mu.Unlock() if p == nil /* && !add */ {
?
pkg/storage/raftentry/cache.go, line 180 at r2 (raw file):
} if truncate { truncIdx := ents[0].Index
Is the case in which this one is used exercised?
pkg/storage/raftentry/cache_test.go, line 184 at r2 (raw file):
that is not
ajwerner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @tbg)
pkg/storage/raftentry/cache.go, line 151 at r2 (raw file):
Is that a thing?
I don't think that's a thing but if it were we wouldn't want to enter here. This logic is about evicting to create space. analyzeEntries always over guesses. If the upper bound of the space we need is 0 then we don't need to do any eviction.
ab583c8 to
6eaec4c
Compare
nvb
left a comment
There was a problem hiding this comment.
TFTRs!
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @tbg)
pkg/storage/raftentry/cache.go, line 151 at r2 (raw file):
Previously, ajwerner wrote…
Is that a thing?
I don't think that's a thing but if it were we wouldn't want to enter here. This logic is about evicting to create space. analyzeEntries always over guesses. If the upper bound of the space we need is 0 then we don't need to do any eviction.
analyzeEntries() == 0 is only a thing if len(ents) == 0. Since we guard against that above, bytesGuessed == 0 is equivalent to !add. I'm fine with either, although I went with this because bytesGuessed is more directly what we're concerned about here for the reasons @ajwerner listed.
pkg/storage/raftentry/cache.go, line 168 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
?
This is now possible because we don't create missing partitions iff !add. The comment is meant to signify p == nil => !add. I'm happy to remove it if it's confusing.
pkg/storage/raftentry/cache.go, line 180 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is the case in which this one is used exercised?
Yes, these two cases are each represented in TestAddAndTruncate.
pkg/storage/raftentry/cache_test.go, line 184 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
that is not
Done.
tbg
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @nvanbenschoten)
pkg/storage/raftentry/cache.go, line 168 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This is now possible because we don't create missing partitions iff
!add. The comment is meant to signifyp == nil => !add. I'm happy to remove it if it's confusing.
That comment style triggers suspicion that it was accidentally included (for example while testing out something locally), just make it explicit.
Fixes cockroachdb#36353. See cockroachdb#36353 (comment) for the reasoning behind how this could lead to the inconsistency observed in that issue. This change adds an option to `raftentry.Cache`'s `Add` method to truncate entries from old terms when inserting new entries into the cache. This ensures that old entries at lower terms are purged from the cache and will never be provided to `etcd/raft`. The approach here is analogous to the truncation in [`Replica.append`](https://github.com/cockroachdb/cockroach/blob/a6b3c540b696002b2ed07036a657612995d6d1ab/pkg/storage/replica_raftstorage.go#L621), but for the Raft entry cache instead of for persistent Raft entries. Release note: None
6eaec4c to
162490c
Compare
nvb
left a comment
There was a problem hiding this comment.
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @tbg)
pkg/storage/raftentry/cache.go, line 168 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
That comment style triggers suspicion that it was accidentally included (for example while testing out something locally), just make it explicit.
Done.
Build failed |
|
OOM in bors r+ |
36360: raftentry: truncate entries from old terms when adding to cache r=nvanbenschoten a=nvanbenschoten Fixes #36353. See #36353 (comment) for the reasoning behind how this could lead to the inconsistency observed in that issue. This change adds an option to `raftentry.Cache`'s `Add` method to truncate entries from old terms when inserting new entries into the cache. This ensures that old entries at lower terms are purged from the cache and will never be provided to `etcd/raft`. The approach here is analogous to the truncation in [`Replica.append`](https://github.com/cockroachdb/cockroach/blob/a6b3c540b696002b2ed07036a657612995d6d1ab/pkg/storage/replica_raftstorage.go#L621), but for the Raft entry cache instead of for persistent Raft entries. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Build succeeded |
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.
It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains
[101, 102, 103, 104]
and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).
What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).
It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 |
the divergent follower would have this:
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 |
This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.
From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).
A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).
The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.
A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.
Closes cockroachdb#61990.
Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.
It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains
[101, 102, 103, 104]
and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).
What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).
It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 |
the divergent follower would have this:
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 |
This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.
From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).
A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).
The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.
A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.
Closes cockroachdb#61990.
Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
63302: raftentry: fix accidental no-op in truncateFrom r=nvanbenschoten,erikgrinaker a=tbg
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.
It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains
[101, 102, 103, 104]
and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).
What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).
It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like
| idx | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 |
the divergent follower would have this:
| idx | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 |
This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.
From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).
A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).
The history of this bug is actually interesting. The code that has
the bug was introduced in #36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.
A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.
Closes #61990.
Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.
It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains
[101, 102, 103, 104]
and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).
What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).
It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 |
the divergent follower would have this:
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 |
This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.
From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).
A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).
The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.
A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.
Closes cockroachdb#61990.
Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.
It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains
[101, 102, 103, 104]
and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).
What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).
It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 |
the divergent follower would have this:
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 |
This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.
From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).
A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).
The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.
A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.
Closes cockroachdb#61990.
Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.
It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains
[101, 102, 103, 104]
and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).
What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).
It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 |
the divergent follower would have this:
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 |
This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.
From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).
A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).
The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.
A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.
Closes cockroachdb#61990.
Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
Fixes #36353.
Fixes #35424.
See #36353 (comment) for the reasoning behind how this could lead to the inconsistency observed in that issue.
This change adds an option to
raftentry.Cache'sAddmethod to truncate entries from old terms when inserting new entries into the cache. This ensures that old entries at lower terms are purged from the cache and will never be provided toetcd/raft. The approach here is analogous to the truncation inReplica.append, but for the Raft entry cache instead of for persistent Raft entries.