engine: better randomized MVCCStats testing; fix bugs#21070
engine: better randomized MVCCStats testing; fix bugs#21070tbg merged 7 commits intocockroachdb:masterfrom
Conversation
|
Reviewed 3 of 3 files at r1, 7 of 7 files at r2, 9 of 9 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 1 of 1 files at r6, 2 of 2 files at r7, 1 of 1 files at r8. pkg/sqlmigrations/migrations.go, line 150 at r3 (raw file):
In commit message: Also, this will break anyone who is expecting that the cluster setting exists. I doubt anyone is actually using it, but at the very least we should mark this as a pkg/sqlmigrations/migrations_test.go, line 450 at r3 (raw file):
Why the quotes on only some column names? pkg/storage/engine/mvcc.go, line 276 at r1 (raw file):
Is this worth supporting/testing then? If the randomized testing is diverging from what it's trying to simulate then it seems like we should fix that instead. Maybe I'm missing a reason why we'd want to support this. I guess it's more general. pkg/storage/engine/mvcc.go, line 341 at r1 (raw file):
pkg/storage/engine/mvcc.go, line 350 at r1 (raw file):
nit, add comment that parallels the others and move this above the pkg/storage/engine/mvcc.go, line 2258 at r3 (raw file):
pkg/storage/engine/mvcc.go, line 241 at r7 (raw file):
This doesn't have an impact on pkg/storage/engine/mvcc_stats_test.go, line 171 at r1 (raw file):
nit: pkg/storage/engine/mvcc_stats_test.go, line 176 at r1 (raw file):
Add pkg/storage/engine/mvcc_stats_test.go, line 183 at r1 (raw file):
This is the only line that differs between this test and the next. Think it's worth collapsing the tests into one? pkg/storage/engine/mvcc_stats_test.go, line 240 at r1 (raw file):
Ditto. pkg/storage/engine/mvcc_stats_test.go, line 245 at r1 (raw file):
Same as above, add pkg/storage/engine/mvcc_stats_test.go, line 361 at r1 (raw file):
What's the argument for not having this error outright? pkg/storage/engine/mvcc_stats_test.go, line 743 at r1 (raw file):
nit: We might want to pull most of this into a constructor for the pkg/storage/engine/mvcc_stats_test.go, line 324 at r2 (raw file):
committing/aborting intents from other transactions? pkg/storage/engine/mvcc_stats_test.go, line 279 at r3 (raw file):
Was this intentional? pkg/storage/engine/mvcc_stats_test.go, line 309 at r3 (raw file):
nit: add some indication that this is only equal to pkg/storage/engine/mvcc_stats_test.go, line 841 at r6 (raw file):
Any reason not to fold this into the first commit to keep it all together? pkg/storage/engine/enginepb/mvcc.go, line 64 at r1 (raw file):
Think this needs a test? Comments from Reviewable |
|
Overall this looks good. I'm a big supporter of randomized testing for things like this where the number of different event permutations gets too large to reason about or test exhaustively. Review status: all files reviewed at latest revision, 19 unresolved discussions, all commit checks successful. Comments from Reviewable |
16ce8c9 to
4f6245b
Compare
|
Review status: all files reviewed at latest revision, 19 unresolved discussions, all commit checks successful. pkg/sqlmigrations/migrations.go, line 150 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/sqlmigrations/migrations_test.go, line 450 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Because it's necessary for camel-cased columns (or it'll look for pkg/storage/engine/mvcc.go, line 276 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Many of the bugs here were because our testing assumed that only very specific operations happen. The way I see it, at the MVCC layer, everything that returns OK should be fair game. Just because we don't happen to move intents in certain ways doesn't mean it should be allowed to return bogus results. (In this case, the decision is further made easy by the fact that the more general code is also not less intuitive). pkg/storage/engine/mvcc.go, line 341 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
MVCCStats obtained via pkg/storage/engine/mvcc.go, line 350 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/storage/engine/mvcc.go, line 2258 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/storage/engine/mvcc.go, line 241 at r7 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/storage/engine/mvcc_stats_test.go, line 171 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/storage/engine/mvcc_stats_test.go, line 176 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/storage/engine/mvcc_stats_test.go, line 183 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I thought about that too, but I think these tests are hard enough to follow that I don't mind the duplication. (All of this stuff is also covered by the randomized tests, so they're really just documentation of things once wrong and to hopefully catch any regressions in the future). pkg/storage/engine/mvcc_stats_test.go, line 240 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/storage/engine/mvcc_stats_test.go, line 245 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/storage/engine/mvcc_stats_test.go, line 361 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
There's no good argument except having to add lots of random checks. I should do it. Filed #21112 pkg/storage/engine/mvcc_stats_test.go, line 743 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'd be happy to see some of those, but I think I'll defer that refactoring to when they're actually used. pkg/storage/engine/mvcc_stats_test.go, line 324 at r2 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
no, our own intents. A snapshot txn with pkg/storage/engine/mvcc_stats_test.go, line 279 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yep. Added a comment though. pkg/storage/engine/mvcc_stats_test.go, line 309 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Added a coefficient 1. BTW, It's pkg/storage/engine/mvcc_stats_test.go, line 841 at r6 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think it's nice to see which things tickled bugs and which didn't. Definitely a matter of taste, but I'd just leave as-is unless you feel strongly. pkg/storage/engine/enginepb/mvcc.go, line 64 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Good point, done. Comments from Reviewable |
|
Reviewed 13 of 13 files at r9, 7 of 7 files at r10, 9 of 9 files at r11, 1 of 1 files at r12, 1 of 1 files at r13, 2 of 2 files at r14, 1 of 1 files at r15. pkg/storage/engine/mvcc.go, line 276 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
👍 pkg/storage/engine/mvcc.go, line 350 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I also meant moving the pkg/storage/engine/mvcc.go, line 362 at r9 (raw file):
Make a note of when this can happen (a push, right?) pkg/storage/engine/mvcc_stats_test.go, line 183 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
👍 pkg/storage/engine/mvcc_stats_test.go, line 324 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
So does the timestamp in the txn and the timestamp passed directly to MVCCPut differ for serializable transactions? Comments from Reviewable |
|
TFTR! Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful. pkg/storage/engine/mvcc.go, line 350 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Ah, gotcha. I think I got it this time! pkg/storage/engine/mvcc.go, line 362 at r9 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/storage/engine/mvcc_stats_test.go, line 324 at r2 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yes and no. Yes in general because serializable transactions can have their timestamp pushed too, except that then of course they'll have to restart eventually. So in a way, when looking at MVCCStats, everything looks like a snapshot transaction for the most part. Comments from Reviewable |
4f6245b to
5b773c9
Compare
This commit removes `TestMVCCStatsWithRandomRuns` and replaces it with a more principled test that achieves much better coverage. This can be seen from the amount of additional tests it inspired, each of which explores one family of failures of the new `TestMVCCStatsRandomized`. Release note: None
the preceding commit has shown that AgeTo (which works as a kind of `Forward()`)is often not what we want. In fact, just moving the stats to the specified wall time is the correct choice almost all the time, perhaps except when updating the replica state's stats (as we prefer to see the highest received timestamp as opposed to the most recent one). Change the semantics accordingly and preserve the old ones via the more descriptive `.Forward()`. Simplify TestCmdClearRangeBytesThreshold: Remove variable `delta` and do not write only inline values. Release note: None
Found via `TestMVCCStatsRandomized`. We weren't correctly updating the stats in the case in which the cluster setting limited the number of keys to delete, but the GC Timestamp covered the first entry. In that case, the stats update would assume implicitly that all values were going to be deleted. Since cockroachdb#20373 and others around the same time have made the cluster setting obsolete, the course of action taken here was simply the removal of this limiting functionality. This is congruent with the fact that at the time at which it was added, `kv.gc.batch_size` was a stopgap fix. Release note (backwards-incompatible change): removed the obsolete `kv.gc.batch_size` cluster setting.
Release note: None
Unsurprisingly, this just worked (it calls `MVCCDelete` under the hood, which in turn has already been tested). Release note: None
When adding in a new value, we weren't taking into account `ms.{Key,Val}Bytes`
when subtracting the contribution of the original value. In particular, the
existing code wasn't paying attention to the fact that system keys can hold
intents as well (such as the range descriptor key would).
Exposed via `TestMVCCStatsRandomized`.
Release note: None
As expected, this just worked (ranged resolutions simply call the corresponding single-key code for ever intent they discover). Also exercises `InitPut`, predictably without problems. Release note: None
5b773c9 to
50d7474
Compare
RaftTombstoneKey was accidentally made a replicated key when it was first introduced, a problem we first realized existed when it was [included in snapshots]. At the time, we included workarounds to skip this key in various places (snapshot application, consistency checker) but of course we have failed to insert further hacks of the same kind elsewhere since (the one that prompting this PR being the stats recomputation on splits, which I'm looking into as part of cockroachdb#20181 -- unfortunately this commit doesn't seem to pertain to that problem) It feels sloppy that we didn't follow through back then, but luckily the damage appears to be limited; it is likely that the replicated existence of this key results in MVCCStats SysBytes inconsistencies, but as it happens, these stats are [already] [very] [inconsistent]. This commit does a few things: - renames the old tombstone key to `RaftIncorrectLegacyTombstoneKey` - introduces a (correctly unreplicated) `RaftTombstoneKey` - introduces a migration that rewrites all legacy keys early in the node start sequence; as a result, a node running this commit or later will never see any legacy keys, except when they come in through a snapshot - when applying a snapshot, forcibly delete any legacy tombstone contained within. `RaftIncorrectLegacyTombstoneKey` can be purged from the codebase in binaries post v2.0, as at that point all peers have version at least v2.0 (which has this commit and will never send legacy tombstones in snapshots). Since sending legacy keys in snapshots was never intended (and isn't relied upon), none of this needs a cluster version migration. Fixes cockroachdb#12154. Release note: None [included in snapshots]: cockroachdb#12131 [already]: cockroachdb#20554 [very]: cockroachdb#20996 [inconsistent]: cockroachdb#21070
RaftTombstoneKey was accidentally made a replicated key when it was first introduced, a problem we first realized existed when it was [included in snapshots]. At the time, we included workarounds to skip this key in various places (snapshot application, consistency checker) but of course we have failed to insert further hacks of the same kind elsewhere since (the one that prompting this PR being the stats recomputation on splits, which I'm looking into as part of cockroachdb#20181 -- unfortunately this commit doesn't seem to pertain to that problem) It feels sloppy that we didn't follow through back then, but luckily the damage appears to be limited; it is likely that the replicated existence of this key results in MVCCStats SysBytes inconsistencies, but as it happens, these stats are [already] [very] [inconsistent]. This commit does a few things: - renames the old tombstone key to `RaftIncorrectLegacyTombstoneKey` - introduces a (correctly unreplicated) `RaftTombstoneKey` - introduces a migration that rewrites all legacy keys early in the node start sequence; as a result, a node running this commit or later will never see any legacy keys, except when they come in through a snapshot - when applying a snapshot, forcibly delete any legacy tombstone contained within. `RaftIncorrectLegacyTombstoneKey` can be purged from the codebase in binaries post v2.0, as at that point all peers have version at least v2.0 (which has this commit and will never send legacy tombstones in snapshots). Since sending legacy keys in snapshots was never intended (and isn't relied upon), none of this needs a cluster version migration. Fixes cockroachdb#12154. Release note: None [included in snapshots]: cockroachdb#12131 [already]: cockroachdb#20554 [very]: cockroachdb#20996 [inconsistent]: cockroachdb#21070
RaftTombstoneKey was accidentally made a replicated key when it was first introduced, a problem we first realized existed when it was [included in snapshots]. At the time, we included workarounds to skip this key in various places (snapshot application, consistency checker) but of course we have failed to insert further hacks of the same kind elsewhere since (the one that prompting this PR being the stats recomputation on splits, which I'm looking into as part of cockroachdb#20181 -- unfortunately this commit doesn't seem to pertain to that problem) It feels sloppy that we didn't follow through back then, but luckily the damage appears to be limited; it is likely that the replicated existence of this key results in MVCCStats SysBytes inconsistencies, but as it happens, these stats are [already] [very] [inconsistent]. This commit does a few things: - renames the old tombstone key to `RaftIncorrectLegacyTombstoneKey` - introduces a (correctly unreplicated) `RaftTombstoneKey` - introduces a migration that rewrites all legacy keys early in the node start sequence; as a result, a node running this commit or later will never see any legacy keys, except when they come in through a snapshot - when applying a snapshot, forcibly delete any legacy tombstone contained within. `RaftIncorrectLegacyTombstoneKey` can be purged from the codebase in binaries post v2.0, as at that point all peers have version at least v2.0 (which has this commit and will never send legacy tombstones in snapshots). Since sending legacy keys in snapshots was never intended (and isn't relied upon), none of this needs a cluster version migration. Fixes cockroachdb#12154. Release note: None [included in snapshots]: cockroachdb#12131 [already]: cockroachdb#20554 [very]: cockroachdb#20996 [inconsistent]: cockroachdb#21070
RaftTombstoneKey was accidentally made a replicated key when it was first introduced, a problem we first realized existed when it was [included in snapshots]. At the time, we included workarounds to skip this key in various places (snapshot application, consistency checker) but of course we have failed to insert further hacks of the same kind elsewhere since (the one that prompting this PR being the stats recomputation on splits, which I'm looking into as part of cockroachdb#20181 -- unfortunately this commit doesn't seem to pertain to that problem) It feels sloppy that we didn't follow through back then, but luckily the damage appears to be limited; it is likely that the replicated existence of this key results in MVCCStats SysBytes inconsistencies, but as it happens, these stats are [already] [very] [inconsistent]. This commit does a few things: - renames the old tombstone key to `RaftIncorrectLegacyTombstoneKey` - introduces a (correctly unreplicated) `RaftTombstoneKey` - introduces a migration that rewrites all legacy keys early in the node start sequence; as a result, a node running this commit or later will never see any legacy keys, except when they come in through a snapshot - when applying a snapshot, forcibly delete any legacy tombstone contained within. `RaftIncorrectLegacyTombstoneKey` can be purged from the codebase in binaries post v2.0, as at that point all peers have version at least v2.0 (which has this commit and will never send legacy tombstones in snapshots). Since sending legacy keys in snapshots was never intended (and isn't relied upon), none of this needs a cluster version migration. Fixes cockroachdb#12154. Release note: None [included in snapshots]: cockroachdb#12131 [already]: cockroachdb#20554 [very]: cockroachdb#20996 [inconsistent]: cockroachdb#21070
RaftTombstoneKey was accidentally made a replicated key when it was first introduced, a problem we first realized existed when it was [included in snapshots]. At the time, we included workarounds to skip this key in various places (snapshot application, consistency checker) but of course we have failed to insert further hacks of the same kind elsewhere since (the one that prompting this PR being the stats recomputation on splits, which I'm looking into as part of cockroachdb#20181 -- unfortunately this commit doesn't seem to pertain to that problem) It feels sloppy that we didn't follow through back then, but luckily the damage appears to be limited; it is likely that the replicated existence of this key results in MVCCStats SysBytes inconsistencies, but as it happens, these stats are [already] [very] [inconsistent]. This commit does a few things: - renames the old tombstone key to `RaftIncorrectLegacyTombstoneKey` - introduces a (correctly unreplicated) `RaftTombstoneKey` - introduces a migration that rewrites all legacy keys, to be activated when the cluster version surpasses the newly introduced version. Until then, both tombstones are checked, but only the old one written. - when applying a snapshot, forcibly delete any legacy tombstone contained within (even before the cluster version is bumped). `RaftIncorrectLegacyTombstoneKey` can be purged from the codebase in binaries post v2.1, as at that point all peers have booted with a version that runs the migration (it's morally true in `v2.0` as well, but it's possible that the cluster version update would interleave with a write of a legacy tombstone). In v2.0, there may be both legacy and old-style tombstones, but they'll never proliferate through snapshots. This implies that if a tombstone is left around from before the `v2.0` migration runs, then it is in a replica that since hasn't been recreated (for if it had, the tombstone would have been wiped), and thus it can't trip up the consistency checker. Thus, post v2.0, the replica consistency checker can stop skipping the legacy tombstone key. Fixes cockroachdb#12154. Release note: None [included in snapshots]: cockroachdb#12131 [already]: cockroachdb#20554 [very]: cockroachdb#20996 [inconsistent]: cockroachdb#21070
RaftTombstoneKey was accidentally made a replicated key when it was first introduced, a problem we first realized existed when it was [included in snapshots]. At the time, we included workarounds to skip this key in various places (snapshot application, consistency checker) but of course we have failed to insert further hacks of the same kind elsewhere since (the one that prompting this PR being the stats recomputation on splits, which I'm looking into as part of cockroachdb#20181 -- unfortunately this commit doesn't seem to pertain to that problem) It feels sloppy that we didn't follow through back then, but luckily the damage appears to be limited; it is likely that the replicated existence of this key results in MVCCStats SysBytes inconsistencies, but as it happens, these stats are [already] [very] [inconsistent]. This commit does a few things: - renames the old tombstone key to `RaftIncorrectLegacyTombstoneKey` - introduces a (correctly unreplicated) `RaftTombstoneKey` - introduces a migration. Once activated, only the new tombstone is written, but both tombstones are checked. Additionally, as the node restarts, all legacy tombstones are replaced by correct non-legacy ones. - when applying a snapshot, forcibly delete any legacy tombstone contained within (even before the cluster version is bumped). This prevents new legacy tombstones to trickle on from other nodes. `RaftIncorrectLegacyTombstoneKey` can be purged from the codebase in binaries post v2.1, as at that point all peers have booted with a version that runs the migration. Thus, post v2.1, the replica consistency checker can stop skipping the legacy tombstone key. Fixes cockroachdb#12154. Release note: None [included in snapshots]: cockroachdb#12131 [already]: cockroachdb#20554 [very]: cockroachdb#20996 [inconsistent]: cockroachdb#21070
RaftTombstoneKey was accidentally made a replicated key when it was first introduced, a problem we first realized existed when it was [included in snapshots]. At the time, we included workarounds to skip this key in various places (snapshot application, consistency checker) but of course we have failed to insert further hacks of the same kind elsewhere since (the one that prompting this PR being the stats recomputation on splits, which I'm looking into as part of cockroachdb#20181 -- unfortunately this commit doesn't seem to pertain to that problem) It feels sloppy that we didn't follow through back then, but luckily the damage appears to be limited; it is likely that the replicated existence of this key results in MVCCStats SysBytes inconsistencies, but as it happens, these stats are [already] [very] [inconsistent]. This commit does a few things: - renames the old tombstone key to `RaftIncorrectLegacyTombstoneKey` - introduces a (correctly unreplicated) `RaftTombstoneKey` - introduces a migration. Once activated, only the new tombstone is written, but both tombstones are checked. Additionally, as the node restarts, all legacy tombstones are replaced by correct non-legacy ones. - when applying a snapshot, forcibly delete any legacy tombstone contained within (even before the cluster version is bumped). This prevents new legacy tombstones to trickle on from other nodes. `RaftIncorrectLegacyTombstoneKey` can be purged from the codebase in binaries post v2.1, as at that point all peers have booted with a version that runs the migration. Thus, post v2.1, the replica consistency checker can stop skipping the legacy tombstone key. Fixes cockroachdb#12154. Release note: None [included in snapshots]: cockroachdb#12131 [already]: cockroachdb#20554 [very]: cockroachdb#20996 [inconsistent]: cockroachdb#21070
This PR removes
TestMVCCStatsWithRandomRunsand replaces it with amore principled test that achieves much better coverage. This can be seen
from the amount of additional tests it inspired, each of which explores one
family of failures of the new
TestMVCCStatsRandomized.Release note: None