Skip to content

roachtest: fail test if crdb_internal.check_consistency finds problems#36241

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:fix/roachtest-consistency-checks
Apr 1, 2019
Merged

roachtest: fail test if crdb_internal.check_consistency finds problems#36241
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:fix/roachtest-consistency-checks

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Mar 27, 2019

This can give us confidence that we'll catch any instances of replica
divergence occurring in roachtest as long as the divergence affects the
MVCCStats keys. This has been the case in all but one reported such issues.

Running the full check is prohibitive and will likely crash the cluster due
to a lack of admission control.

Release note: None

@tbg tbg requested a review from andreimatei March 27, 2019 20:30
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Mar 27, 2019

(Definitely going to run this against a few choice roachtests before I merge it to make sure it behaves as expected)

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Apr 1, 2019

Rebased on top of #36320, PTAL.

@tbg tbg force-pushed the fix/roachtest-consistency-checks branch from 779f507 to 29a15fd Compare April 1, 2019 10:33
@tbg tbg requested review from a team April 1, 2019 10:33
Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)

@tbg tbg force-pushed the fix/roachtest-consistency-checks branch from 29a15fd to 8a4a49c Compare April 1, 2019 18:28
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Apr 1, 2019

TFTR! Rebased on your PR.

bors r=andreimatei

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 1, 2019

Build failed

This can give us confidence that we'll catch any instances of replica
divergence occurring in roachtest as long as the divergence affects
the MVCCStats keys. This has been the case in all but one reported
such issues.

Running the full check is prohibitive and will likely crash the
cluster due to a lack of admission control at this point, but
if we improve its underlying generator to run only N checks in
parallel it should work just fine, though it'll take a long time.

Release note: None
@tbg tbg force-pushed the fix/roachtest-consistency-checks branch from 8a4a49c to aedc595 Compare April 1, 2019 20:33
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Apr 1, 2019

Had to skip the consistency check during unit tests.

bors r=andreimatei

craig bot pushed a commit that referenced this pull request Apr 1, 2019
36241: roachtest: fail test if crdb_internal.check_consistency finds problems r=andreimatei a=tbg

This can give us confidence that we'll catch any instances of replica
divergence occurring in roachtest as long as the divergence affects the
MVCCStats keys. This has been the case in all but one reported such issues.

Running the full check is prohibitive and will likely crash the cluster due
to a lack of admission control.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 1, 2019

Build succeeded

@craig craig bot merged commit aedc595 into cockroachdb:master Apr 1, 2019
@tbg tbg deleted the fix/roachtest-consistency-checks branch April 1, 2019 21:14
tbg added a commit to tbg/cockroach that referenced this pull request Apr 8, 2021
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.

It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains

    [101, 102, 103, 104]

and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).

What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).

It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
 cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | 9   | 9   | 9   | 9   | 9   | 9   |

the divergent follower would have this:

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | *7* | *7* | *7* | 9   | 9   | 9   |

This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.

From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).

A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).

The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.

A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.

Closes cockroachdb#61990.

Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
tbg added a commit to tbg/cockroach that referenced this pull request Apr 12, 2021
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.

It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains

    [101, 102, 103, 104]

and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).

What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).

It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
 cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | 9   | 9   | 9   | 9   | 9   | 9   |

the divergent follower would have this:

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | *7* | *7* | *7* | 9   | 9   | 9   |

This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.

From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).

A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).

The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.

A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.

Closes cockroachdb#61990.

Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
craig bot pushed a commit that referenced this pull request Apr 12, 2021
63302: raftentry: fix accidental no-op in truncateFrom r=nvanbenschoten,erikgrinaker a=tbg

When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.

It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains

    [101, 102, 103, 104]

and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).

What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).

It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
 #36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like

| idx  | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | 9   | 9   | 9   | 9   | 9   | 9   |

the divergent follower would have this:

| idx  | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | *7* | *7* | *7* | 9   | 9   | 9   |

This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.

From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).

A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).

The history of this bug is actually interesting. The code that has
the bug was introduced in #36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.

A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.

Closes #61990.

Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
tbg added a commit to tbg/cockroach that referenced this pull request Apr 12, 2021
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.

It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains

    [101, 102, 103, 104]

and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).

What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).

It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
 cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | 9   | 9   | 9   | 9   | 9   | 9   |

the divergent follower would have this:

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | *7* | *7* | *7* | 9   | 9   | 9   |

This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.

From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).

A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).

The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.

A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.

Closes cockroachdb#61990.

Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
tbg added a commit to tbg/cockroach that referenced this pull request Apr 12, 2021
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.

It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains

    [101, 102, 103, 104]

and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).

What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).

It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
 cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | 9   | 9   | 9   | 9   | 9   | 9   |

the divergent follower would have this:

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | *7* | *7* | *7* | 9   | 9   | 9   |

This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.

From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).

A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).

The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.

A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.

Closes cockroachdb#61990.

Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
tbg added a commit to tbg/cockroach that referenced this pull request Apr 12, 2021
When entries are appended to the raft log, they are also added to the
raft entry cache. Appending indexes, say, `100, 101, 102` to the raft
log has the (absolutely crucial for correctness) effect of removing any
higher indexes in the raft log ( `103` and up) and for correctness, the
raft entry cache must follow the same semantics, so that it is always
a coherent subset of the raft log.

It turns out it didn't do that in one edge case, namely when appending
a slice of entries that didn't fit into the cache, but whose first
entry preceded the currently cached entries. For a concrete example,
imagine the cache contains

    [101, 102, 103, 104]

and we are attempting to add a slice of large entries at indexes `[50,
51, 52]` which can not fit the cache at all (i.e. their aggregate size
exceeds that of the cache).

What *should* be the result of this addition is an empty cache. This
is because the true raft log now ends at index 52. In reality though,
the cache accidentally turned this operation into a no-op, as it was
using an iterator that would be invalid when positioned to an index
not currently cached (in this example, index `50`, whereas the first
cached index would be `101`).

It took us a while to identify this bug, as it manifested itself to us
only through a very rare replica inconsistency failure (cockroachdb#61990) of the
`hotspotsplits/nodes=4` roachtest (which was only caught thanks to
 cockroachdb#36241) that would then take days to reproduce. When we finally
managed to get a reproduction that contain the entire history of disk
writes for the cluster, we found that the inconsistency had occured
due to entries from a past term being replicated to the divergent
follower. Concretely, where the leader had something like

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | 9   | 9   | 9   | 9   | 9   | 9   |

the divergent follower would have this:

| idx  | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 |
|------|-----|-----|-----|-----|-----|-----|-----|-----|
| term | 9   | 9   | *7* | *7* | *7* | 9   | 9   | 9   |

This made it likely that the issue was connected to a raft leadership
change in which the "old" leader for term 7 had proposed entries for
indexes up to and including at least index 105, which the term 8 and
9 leaders had then discarded and replaced by entries for term 9. Of
course, these entries should consequently never have been committed
and in particular, the term can never regress in the raft log, even
for uncommitted entries, at least when Raft is working correctly.

From the WAL writes, we were able to ascertain that the leader (who
itself had the correct writes in the log) had in fact replicated the
incorrect (term 7) entries to the follower. This meant that that we
were either looking at pebble flat-out serving stale reads (highly
unlikely) or a bug in the raft entry cache. Being unable to find
issues through direct inspection, we added invariants and tweaked the
test to exacerbate what we considered the triggering conditions
(relatively frequent term changes).

A crucial assertion that allowed identifying this bug was adding
invariant checks (which will be re-added in a future commit) that
verified that the truncation had indeed removed subsequent entries.
This highlighted that we were hitting this bug in at least ~5% of
hotspotsplits runs, but possibly more frequently (the assertion did not
catch all instances of the bug). `hotspotsplits` brings the crucial
ingredients here: it's doing concentrated 256KiB inserts which puts the
raft group into mild overload; many entries pile up and get processed in
aggregate, which leads to constantly taking the path in which
truncations are erroneously ignored. However, most of the time omitting
the eviction does not yield replica inconsistency. For this to happen,
the erroneously cached entries need to be picked up by raft for sending
to a follower; they need to be the "incorrect" (which necessitates a
discarded log and term change at the right moment), and finally they
must be in log positions at which the correct entries actually had an
effect (which is to be fair mostly the case).

The history of this bug is actually interesting. The code that has
the bug was introduced in cockroachdb#36360. However, that PR fixed a much more
egregious version of this bug - prior to that PR, the raft entry
cache simply didn't evict entries at higher positions at all, i.e.
it exhibited the bug in many more scenarios, and this consequently
lead to more frequent inconsistencies (as referenced in the issue).
So we went from bad, moderately rare bug to bad, very very rare bug
and, in this PR, hopefully, to no bug.

A rough calculation suggests that we spent around 800.000 minutes of
`n2-standard-4` instances minutes on reproductions related to this bug
in this instance alone, which I think boils down to around $39.000. I
like to think we got off cheap here.

Closes cockroachdb#61990.

Release note (bug fix): A rare issue that could cause replica divergence
was fixed. These issues would be reported by the replica consistency
checker, typically within 24 hours of occurrence, which would cause
nodes to terminate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants