roachtest: fail test if crdb_internal.check_consistency finds problems by tbg · Pull Request #36241 · cockroachdb/cockroach

tbg · 2019-03-27T20:30:40Z

This can give us confidence that we'll catch any instances of replica
divergence occurring in roachtest as long as the divergence affects the
MVCCStats keys. This has been the case in all but one reported such issues.

Running the full check is prohibitive and will likely crash the cluster due
to a lack of admission control.

Release note: None

cockroach-teamcity · 2019-03-27T20:30:49Z

This change is

tbg · 2019-03-27T20:31:55Z

(Definitely going to run this against a few choice roachtests before I merge it to make sure it behaves as expected)

tbg · 2019-04-01T10:33:17Z

Rebased on top of #36320, PTAL.

andreimatei

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)

tbg · 2019-04-01T18:28:41Z

TFTR! Rebased on your PR.

bors r=andreimatei

craig · 2019-04-01T19:01:47Z

Build failed

GitHub CI (Cockroach)

This can give us confidence that we'll catch any instances of replica divergence occurring in roachtest as long as the divergence affects the MVCCStats keys. This has been the case in all but one reported such issues. Running the full check is prohibitive and will likely crash the cluster due to a lack of admission control at this point, but if we improve its underlying generator to run only N checks in parallel it should work just fine, though it'll take a long time. Release note: None

tbg · 2019-04-01T20:34:06Z

Had to skip the consistency check during unit tests.

bors r=andreimatei

36241: roachtest: fail test if crdb_internal.check_consistency finds problems r=andreimatei a=tbg This can give us confidence that we'll catch any instances of replica divergence occurring in roachtest as long as the divergence affects the MVCCStats keys. This has been the case in all but one reported such issues. Running the full check is prohibitive and will likely crash the cluster due to a lack of admission control. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

craig · 2019-04-01T20:55:30Z

Build succeeded

GitHub CI (Cockroach)

When entries are appended to the raft log, they are also added to the raft entry cache. Appending indexes, say, `100, 101, 102` to the raft log has the (absolutely crucial for correctness) effect of removing any higher indexes in the raft log ( `103` and up) and for correctness, the raft entry cache must follow the same semantics, so that it is always a coherent subset of the raft log. It turns out it didn't do that in one edge case, namely when appending a slice of entries that didn't fit into the cache, but whose first entry preceded the currently cached entries. For a concrete example, imagine the cache contains [101, 102, 103, 104] and we are attempting to add a slice of large entries at indexes `[50, 51, 52]` which can not fit the cache at all (i.e. their aggregate size exceeds that of the cache). What *should* be the result of this addition is an empty cache. This is because the true raft log now ends at index 52. In reality though, the cache accidentally turned this operation into a no-op, as it was using an iterator that would be invalid when positioned to an index not currently cached (in this example, index `50`, whereas the first cached index would be `101`). It took us a while to identify this bug, as it manifested itself to us only through a very rare replica inconsistency failure (cockroachdb#61990) of the `hotspotsplits/nodes=4` roachtest (which was only caught thanks to cockroachdb#36241) that would then take days to reproduce. When we finally managed to get a reproduction that contain the entire history of disk writes for the cluster, we found that the inconsistency had occured due to entries from a past term being replicated to the divergent follower. Concretely, where the leader had something like | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | the divergent follower would have this: | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 | This made it likely that the issue was connected to a raft leadership change in which the "old" leader for term 7 had proposed entries for indexes up to and including at least index 105, which the term 8 and 9 leaders had then discarded and replaced by entries for term 9. Of course, these entries should consequently never have been committed and in particular, the term can never regress in the raft log, even for uncommitted entries, at least when Raft is working correctly. From the WAL writes, we were able to ascertain that the leader (who itself had the correct writes in the log) had in fact replicated the incorrect (term 7) entries to the follower. This meant that that we were either looking at pebble flat-out serving stale reads (highly unlikely) or a bug in the raft entry cache. Being unable to find issues through direct inspection, we added invariants and tweaked the test to exacerbate what we considered the triggering conditions (relatively frequent term changes). A crucial assertion that allowed identifying this bug was adding invariant checks (which will be re-added in a future commit) that verified that the truncation had indeed removed subsequent entries. This highlighted that we were hitting this bug in at least ~5% of hotspotsplits runs, but possibly more frequently (the assertion did not catch all instances of the bug). `hotspotsplits` brings the crucial ingredients here: it's doing concentrated 256KiB inserts which puts the raft group into mild overload; many entries pile up and get processed in aggregate, which leads to constantly taking the path in which truncations are erroneously ignored. However, most of the time omitting the eviction does not yield replica inconsistency. For this to happen, the erroneously cached entries need to be picked up by raft for sending to a follower; they need to be the "incorrect" (which necessitates a discarded log and term change at the right moment), and finally they must be in log positions at which the correct entries actually had an effect (which is to be fair mostly the case). The history of this bug is actually interesting. The code that has the bug was introduced in cockroachdb#36360. However, that PR fixed a much more egregious version of this bug - prior to that PR, the raft entry cache simply didn't evict entries at higher positions at all, i.e. it exhibited the bug in many more scenarios, and this consequently lead to more frequent inconsistencies (as referenced in the issue). So we went from bad, moderately rare bug to bad, very very rare bug and, in this PR, hopefully, to no bug. A rough calculation suggests that we spent around 800.000 minutes of `n2-standard-4` instances minutes on reproductions related to this bug in this instance alone, which I think boils down to around $39.000. I like to think we got off cheap here. Closes cockroachdb#61990. Release note (bug fix): A rare issue that could cause replica divergence was fixed. These issues would be reported by the replica consistency checker, typically within 24 hours of occurrence, which would cause nodes to terminate.

63302: raftentry: fix accidental no-op in truncateFrom r=nvanbenschoten,erikgrinaker a=tbg When entries are appended to the raft log, they are also added to the raft entry cache. Appending indexes, say, `100, 101, 102` to the raft log has the (absolutely crucial for correctness) effect of removing any higher indexes in the raft log ( `103` and up) and for correctness, the raft entry cache must follow the same semantics, so that it is always a coherent subset of the raft log. It turns out it didn't do that in one edge case, namely when appending a slice of entries that didn't fit into the cache, but whose first entry preceded the currently cached entries. For a concrete example, imagine the cache contains [101, 102, 103, 104] and we are attempting to add a slice of large entries at indexes `[50, 51, 52]` which can not fit the cache at all (i.e. their aggregate size exceeds that of the cache). What *should* be the result of this addition is an empty cache. This is because the true raft log now ends at index 52. In reality though, the cache accidentally turned this operation into a no-op, as it was using an iterator that would be invalid when positioned to an index not currently cached (in this example, index `50`, whereas the first cached index would be `101`). It took us a while to identify this bug, as it manifested itself to us only through a very rare replica inconsistency failure (#61990) of the `hotspotsplits/nodes=4` roachtest (which was only caught thanks to #36241) that would then take days to reproduce. When we finally managed to get a reproduction that contain the entire history of disk writes for the cluster, we found that the inconsistency had occured due to entries from a past term being replicated to the divergent follower. Concretely, where the leader had something like | idx | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | the divergent follower would have this: | idx | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 | This made it likely that the issue was connected to a raft leadership change in which the "old" leader for term 7 had proposed entries for indexes up to and including at least index 105, which the term 8 and 9 leaders had then discarded and replaced by entries for term 9. Of course, these entries should consequently never have been committed and in particular, the term can never regress in the raft log, even for uncommitted entries, at least when Raft is working correctly. From the WAL writes, we were able to ascertain that the leader (who itself had the correct writes in the log) had in fact replicated the incorrect (term 7) entries to the follower. This meant that that we were either looking at pebble flat-out serving stale reads (highly unlikely) or a bug in the raft entry cache. Being unable to find issues through direct inspection, we added invariants and tweaked the test to exacerbate what we considered the triggering conditions (relatively frequent term changes). A crucial assertion that allowed identifying this bug was adding invariant checks (which will be re-added in a future commit) that verified that the truncation had indeed removed subsequent entries. This highlighted that we were hitting this bug in at least ~5% of hotspotsplits runs, but possibly more frequently (the assertion did not catch all instances of the bug). `hotspotsplits` brings the crucial ingredients here: it's doing concentrated 256KiB inserts which puts the raft group into mild overload; many entries pile up and get processed in aggregate, which leads to constantly taking the path in which truncations are erroneously ignored. However, most of the time omitting the eviction does not yield replica inconsistency. For this to happen, the erroneously cached entries need to be picked up by raft for sending to a follower; they need to be the "incorrect" (which necessitates a discarded log and term change at the right moment), and finally they must be in log positions at which the correct entries actually had an effect (which is to be fair mostly the case). The history of this bug is actually interesting. The code that has the bug was introduced in #36360. However, that PR fixed a much more egregious version of this bug - prior to that PR, the raft entry cache simply didn't evict entries at higher positions at all, i.e. it exhibited the bug in many more scenarios, and this consequently lead to more frequent inconsistencies (as referenced in the issue). So we went from bad, moderately rare bug to bad, very very rare bug and, in this PR, hopefully, to no bug. A rough calculation suggests that we spent around 800.000 minutes of `n2-standard-4` instances minutes on reproductions related to this bug in this instance alone, which I think boils down to around $39.000. I like to think we got off cheap here. Closes #61990. Release note (bug fix): A rare issue that could cause replica divergence was fixed. These issues would be reported by the replica consistency checker, typically within 24 hours of occurrence, which would cause nodes to terminate. Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

When entries are appended to the raft log, they are also added to the raft entry cache. Appending indexes, say, `100, 101, 102` to the raft log has the (absolutely crucial for correctness) effect of removing any higher indexes in the raft log ( `103` and up) and for correctness, the raft entry cache must follow the same semantics, so that it is always a coherent subset of the raft log. It turns out it didn't do that in one edge case, namely when appending a slice of entries that didn't fit into the cache, but whose first entry preceded the currently cached entries. For a concrete example, imagine the cache contains [101, 102, 103, 104] and we are attempting to add a slice of large entries at indexes `[50, 51, 52]` which can not fit the cache at all (i.e. their aggregate size exceeds that of the cache). What *should* be the result of this addition is an empty cache. This is because the true raft log now ends at index 52. In reality though, the cache accidentally turned this operation into a no-op, as it was using an iterator that would be invalid when positioned to an index not currently cached (in this example, index `50`, whereas the first cached index would be `101`). It took us a while to identify this bug, as it manifested itself to us only through a very rare replica inconsistency failure (cockroachdb#61990) of the `hotspotsplits/nodes=4` roachtest (which was only caught thanks to cockroachdb#36241) that would then take days to reproduce. When we finally managed to get a reproduction that contain the entire history of disk writes for the cluster, we found that the inconsistency had occured due to entries from a past term being replicated to the divergent follower. Concretely, where the leader had something like | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | the divergent follower would have this: | idx | 100 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | |------|-----|-----|-----|-----|-----|-----|-----|-----| | term | 9 | 9 | *7* | *7* | *7* | 9 | 9 | 9 | This made it likely that the issue was connected to a raft leadership change in which the "old" leader for term 7 had proposed entries for indexes up to and including at least index 105, which the term 8 and 9 leaders had then discarded and replaced by entries for term 9. Of course, these entries should consequently never have been committed and in particular, the term can never regress in the raft log, even for uncommitted entries, at least when Raft is working correctly. From the WAL writes, we were able to ascertain that the leader (who itself had the correct writes in the log) had in fact replicated the incorrect (term 7) entries to the follower. This meant that that we were either looking at pebble flat-out serving stale reads (highly unlikely) or a bug in the raft entry cache. Being unable to find issues through direct inspection, we added invariants and tweaked the test to exacerbate what we considered the triggering conditions (relatively frequent term changes). A crucial assertion that allowed identifying this bug was adding invariant checks (which will be re-added in a future commit) that verified that the truncation had indeed removed subsequent entries. This highlighted that we were hitting this bug in at least ~5% of hotspotsplits runs, but possibly more frequently (the assertion did not catch all instances of the bug). `hotspotsplits` brings the crucial ingredients here: it's doing concentrated 256KiB inserts which puts the raft group into mild overload; many entries pile up and get processed in aggregate, which leads to constantly taking the path in which truncations are erroneously ignored. However, most of the time omitting the eviction does not yield replica inconsistency. For this to happen, the erroneously cached entries need to be picked up by raft for sending to a follower; they need to be the "incorrect" (which necessitates a discarded log and term change at the right moment), and finally they must be in log positions at which the correct entries actually had an effect (which is to be fair mostly the case). The history of this bug is actually interesting. The code that has the bug was introduced in cockroachdb#36360. However, that PR fixed a much more egregious version of this bug - prior to that PR, the raft entry cache simply didn't evict entries at higher positions at all, i.e. it exhibited the bug in many more scenarios, and this consequently lead to more frequent inconsistencies (as referenced in the issue). So we went from bad, moderately rare bug to bad, very very rare bug and, in this PR, hopefully, to no bug. A rough calculation suggests that we spent around 800.000 minutes of `n2-standard-4` instances minutes on reproductions related to this bug in this instance alone, which I think boils down to around $39.000. I like to think we got off cheap here. Closes cockroachdb#61990. Release note (bug fix): A rare issue that could cause replica divergence was fixed. These issues would be reported by the replica consistency checker, typically within 24 hours of occurrence, which would cause nodes to terminate.

tbg requested a review from andreimatei March 27, 2019 20:30

tbg force-pushed the fix/roachtest-consistency-checks branch from 779f507 to 29a15fd Compare April 1, 2019 10:33

tbg requested review from a team April 1, 2019 10:33

andreimatei reviewed Apr 1, 2019

View reviewed changes

tbg force-pushed the fix/roachtest-consistency-checks branch from 29a15fd to 8a4a49c Compare April 1, 2019 18:28

tbg force-pushed the fix/roachtest-consistency-checks branch from 8a4a49c to aedc595 Compare April 1, 2019 20:33

craig bot merged commit aedc595 into cockroachdb:master Apr 1, 2019

tbg deleted the fix/roachtest-consistency-checks branch April 1, 2019 21:14

tbg mentioned this pull request Apr 8, 2021

raftentry: fix accidental no-op in truncateFrom #63302

Merged

tbg mentioned this pull request Apr 12, 2021

release-21.1: raftentry: fix accidental no-op in truncateFrom #63473

Merged

tbg mentioned this pull request Apr 12, 2021

release-20.2: raftentry: fix accidental no-op in truncateFrom #63474

Merged

tbg mentioned this pull request Apr 12, 2021

release-20.1: raftentry: fix accidental no-op in truncateFrom #63475

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: fail test if crdb_internal.check_consistency finds problems#36241

roachtest: fail test if crdb_internal.check_consistency finds problems#36241
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:fix/roachtest-consistency-checks

tbg commented Mar 27, 2019 •

edited

Loading

Uh oh!

cockroach-teamcity commented Mar 27, 2019

Uh oh!

tbg commented Mar 27, 2019

Uh oh!

tbg commented Apr 1, 2019

Uh oh!

andreimatei left a comment

Uh oh!

tbg commented Apr 1, 2019

Uh oh!

craig bot commented Apr 1, 2019

Uh oh!

tbg commented Apr 1, 2019

Uh oh!

craig bot commented Apr 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tbg commented Mar 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Mar 27, 2019

Uh oh!

tbg commented Mar 27, 2019

Uh oh!

tbg commented Apr 1, 2019

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

tbg commented Apr 1, 2019

Uh oh!

craig bot commented Apr 1, 2019

Build failed

Uh oh!

tbg commented Apr 1, 2019

Uh oh!

craig bot commented Apr 1, 2019

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tbg commented Mar 27, 2019 •

edited

Loading