roachtest: fail test if crdb_internal.check_consistency finds problems#36241
Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom Apr 1, 2019
Merged
Conversation
Member
Member
Author
|
(Definitely going to run this against a few choice roachtests before I merge it to make sure it behaves as expected) |
Member
Author
|
Rebased on top of #36320, PTAL. |
779f507 to
29a15fd
Compare
andreimatei
reviewed
Apr 1, 2019
Contributor
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)
29a15fd to
8a4a49c
Compare
Member
Author
|
TFTR! Rebased on your PR. bors r=andreimatei |
Contributor
Build failed |
This can give us confidence that we'll catch any instances of replica divergence occurring in roachtest as long as the divergence affects the MVCCStats keys. This has been the case in all but one reported such issues. Running the full check is prohibitive and will likely crash the cluster due to a lack of admission control at this point, but if we improve its underlying generator to run only N checks in parallel it should work just fine, though it'll take a long time. Release note: None
8a4a49c to
aedc595
Compare
Member
Author
|
Had to skip the consistency check during unit tests. bors r=andreimatei |
craig bot
pushed a commit
that referenced
this pull request
Apr 1, 2019
36241: roachtest: fail test if crdb_internal.check_consistency finds problems r=andreimatei a=tbg This can give us confidence that we'll catch any instances of replica divergence occurring in roachtest as long as the divergence affects the MVCCStats keys. This has been the case in all but one reported such issues. Running the full check is prohibitive and will likely crash the cluster due to a lack of admission control. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Contributor
Build succeeded |
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Apr 8, 2021
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.
It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains
[101, 102, 103, 104]
and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).
What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).
It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 |
the divergent follower would have this:
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 |
This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.
From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).
A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).
The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.
A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.
Closes cockroachdb#61990.
Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Apr 12, 2021
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.
It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains
[101, 102, 103, 104]
and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).
What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).
It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 |
the divergent follower would have this:
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 |
This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.
From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).
A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).
The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.
A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.
Closes cockroachdb#61990.
Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
craig bot
pushed a commit
that referenced
this pull request
Apr 12, 2021
63302: raftentry: fix accidental no-op in truncateFrom r=nvanbenschoten,erikgrinaker a=tbg
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.
It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains
[101, 102, 103, 104]
and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).
What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).
It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like
| idx | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 |
the divergent follower would have this:
| idx | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 |
This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.
From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).
A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).
The history of this bug is actually interesting. The code that has
the bug was introduced in #36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.
A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.
Closes #61990.
Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Apr 12, 2021
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.
It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains
[101, 102, 103, 104]
and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).
What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).
It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 |
the divergent follower would have this:
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 |
This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.
From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).
A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).
The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.
A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.
Closes cockroachdb#61990.
Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Apr 12, 2021
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.
It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains
[101, 102, 103, 104]
and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).
What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).
It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 |
the divergent follower would have this:
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 |
This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.
From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).
A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).
The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.
A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.
Closes cockroachdb#61990.
Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Apr 12, 2021
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.
It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains
[101, 102, 103, 104]
and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).
What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).
It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 |
the divergent follower would have this:
| idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 |
This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.
From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).
A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).
The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.
A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.
Closes cockroachdb#61990.
Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This can give us confidence that we'll catch any instances of replica
divergence occurring in roachtest as long as the divergence affects the
MVCCStats keys. This has been the case in all but one reported such issues.
Running the full check is prohibitive and will likely crash the cluster due
to a lack of admission control.
Release note: None