kvserver: repro branch for hotspotsplits issue#62791
Closed
tbg wants to merge 22 commits intocockroachdb:masterfrom
Closed
kvserver: repro branch for hotspotsplits issue#62791tbg wants to merge 22 commits intocockroachdb:masterfrom
tbg wants to merge 22 commits intocockroachdb:masterfrom
Conversation
In cockroachdb#61990 we had this method catch a stats divergence on teardown in an otherwise successful test. The call to `t.Fatal` in that method unfortunately prevented the logs from being collected, which is not helpful. Release note: None
Perplexingly, when I run `./cockroach start-single-node --logtostderr --insecure` and against it
`bin/workload kv --max-rate=10 --read-percent=0`, I see that ~30% of the proposals
are... empty entries?! This seems unexpected to me, these should not be
emitted by raft in steady state as far as I can tell? I verified that
the term is not going up, and that the pattern persists for minutes.
```
I210330 10:30:39.144043 234 kv/kvserver/replica_application_state_machine.go:446 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2863 applying command with forced error: ‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry
I210330 10:30:39.144165 234 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2864 idx=1260 mlai=0 forcedErr=‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry: ‹<nil>›
I210330 10:30:39.240853 241 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2865 idx=1261 mlai=834 forcedErr=<nil>: ‹Put: 1617100239.237121482,0 /Table/53/1/2850286295254940972/0 (0xbd89fd278e4048b8d5bd2c8800167118198ded85ca09):›
I210330 10:30:39.340764 243 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2866 idx=1262 mlai=835 forcedErr=<nil>: ‹Put: 1617100239.337076872,0 /Table/53/1/-6356694038928198937/0 (0xbd8980a7c87c7367d772e788001671181993e2b88809):›
I210330 10:30:39.343700 243 kv/kvserver/replica_application_state_machine.go:446 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2867 applying command with forced error: ‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry
I210330 10:30:39.343806 243 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2868 idx=1263 mlai=0 forcedErr=‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry: ‹<nil>›
I210330 10:30:39.440865 245 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2869 idx=1264 mlai=836 forcedErr=<nil>: ‹Put: 1617100239.437089067,0 /Table/53/1/-7437556609187321794/0 (0xbd898098c87dab7887e83e88001671181999d8c92b09):›
I210330 10:30:39.540660 246 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2870 idx=1265 mlai=837 forcedErr=<nil>: ‹Put: 1617100239.537074620,0 /Table/53/1/6351514490197923581/0 (0xbd89fd58251cc6edfe56fd8800167118199fce71bc09):›
I210330 10:30:39.543499 246 kv/kvserver/replica_application_state_machine.go:446 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2871 applying command with forced error: ‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry
I210330 10:30:39.543594 246 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2872 idx=1266 mlai=0 forcedErr=‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry: ‹<nil>›
I210330 10:30:39.640657 247 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2873 idx=1267 mlai=838 forcedErr=<nil>: ‹Put: 1617100239.637102583,0 /Table/53/1/7437542101176764196/0 (0xbd89fd673775229ea74724880016711819a5c4bff709):›
I210330 10:30:39.740974 248 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2874 idx=1268 mlai=839 forcedErr=<nil>: ‹Put: 1617100239.737099626,0 /Table/53/1/6928540211947731053/0 (0xbd89fd60271e9d2127786d880016711819abba956a09):›
I210330 10:30:39.743917 248 kv/kvserver/replica_application_state_machine.go:446 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2875 applying command with forced error: ‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry
I210330 10:30:39.744028 248 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2876 idx=1269 mlai=0 forcedErr=‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry: ‹<nil>›
I210330 10:30:39.840652 250 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2877 idx=1270 mlai=840 forcedErr=<nil>: ‹Put: 1617100239.837097210,0 /Table/53/1/-185871630413023141/0 (0xbd8980fd6ba6bca768b85b880016711819b1b06cfa09):›
I210330 10:30:39.940911 252 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2878 idx=1271 mlai=841 forcedErr=<nil>: ‹Put: 1617100239.937072913,0 /Table/53/1/8801482429265706073/0 (0xbd89fd7a2525a31d397c59880016711819b7a5ef1109):›
I210330 10:30:39.943859 252 kv/kvserver/replica_application_state_machine.go:446 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2879 applying command with forced error: ‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry
I210330 10:30:39.943977 252 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2880 idx=1272 mlai=0 forcedErr=‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry: ‹<nil>›
I210330 10:30:40.040607 261 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2881 idx=1273 mlai=842 forcedErr=<nil>: ‹Put: 1617100240.037083544,0 /Table/53/1/136968409589023503/0 (0xbd89fd01e69c0af701030f880016711819bd9bf99809):›
I210330 10:30:40.140983 262 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2882 idx=1274 mlai=843 forcedErr=<nil>: ‹Put: 1617100240.137103713,0 /Table/53/1/-6530552929161390779/0 (0xbd8980a55ed0b61672d545880016711819c392296109):›
I210330 10:30:40.143834 262 kv/kvserver/replica_application_state_machine.go:446 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2883 applying command with forced error: ‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry
I210330 10:30:40.143944 262 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2884 idx=1275 mlai=0 forcedErr=‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry: ‹<nil>›
I210330 10:30:40.240619 253 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2885 idx=1276 mlai=844 forcedErr=<nil>: ‹Put: 1617100240.237056860,0 /Table/53/1/-5536958579423093401/0 (0xbd8980b328c582116f5d67880016711819c987535c09):›
I210330 10:30:40.341190 260 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2886 idx=1277 mlai=845 forcedErr=<nil>: ‹Put: 1617100240.337302231,0 /Table/53/1/-1684406313953726943/0 (0xbd8980e89fc961bf5cee21880016711819cf80f2d709):›
I210330 10:30:40.344021 260 kv/kvserver/replica_application_state_machine.go:446 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2887 applying command with forced error: ‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry
I210330 10:30:40.344117 260 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2888 idx=1278 mlai=0 forcedErr=‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry: ‹<nil>›
I210330 10:30:40.440918 263 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2889 idx=1279 mlai=846 forcedErr=<nil>: ‹Put: 1617100240.437037366,0 /Table/53/1/-2038798571150465276/0 (0xbd8980e3b4bb806600a704880016711819d572c93609):›
I210330 10:30:40.540687 255 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2890 idx=1280 mlai=847 forcedErr=<nil>: ‹Put: 1617100240.537087989,0 /Table/53/1/2068322336664695263/0 (0xbd89fd1cb42834ede7b1df880016711819db696ff509):›
I210330 10:30:40.543574 255 kv/kvserver/replica_application_state_machine.go:446 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2891 applying command with forced error: ‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry
I210330 10:30:40.543689 255 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2892 idx=1281 mlai=0 forcedErr=‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry: ‹<nil>›
I210330 10:30:40.642543 275 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2893 idx=1282 mlai=848 forcedErr=<nil>: ‹Put: 1617100240.637041509,0 /Table/53/1/-2575353514682209952/0 (0xbd8980dc42819f8a502960880016711819e15e9b6509):›
I210330 10:30:40.740678 256 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2894 idx=1283 mlai=849 forcedErr=<nil>: ‹Put: 1617100240.737047364,0 /Table/53/1/2429899135975807835/0 (0xbd89fd21b8bc63d7dddb5b880016711819e754934409):›
I210330 10:30:40.743399 256 kv/kvserver/replica_application_state_machine.go:446 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2895 applying command with forced error: ‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry
I210330 10:30:40.743491 256 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2896 idx=1284 mlai=0 forcedErr=‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry: ‹<nil>›
I210330 10:30:40.840984 258 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2897 idx=1285 mlai=850 forcedErr=<nil>: ‹Put: 1617100240.837095095,0 /Table/53/1/5460400427895041357/0 (0xbd89fd4bc73d423c132d4d880016711819ed4b2eb709):›
I210330 10:30:40.944030 268 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2898 idx=1286 mlai=851 forcedErr=<nil>: ‹Put: 1617100240.937078927,0 /Table/53/1/6057308348334647400/0 (0xbd89fd540fe1d98d7ea868880016711819f340d08f09):›
I210330 10:30:40.946959 268 kv/kvserver/replica_application_state_machine.go:446 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2899 applying command with forced error: ‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry
I210330 10:30:40.947064 268 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r37/1:‹/{Table/53-Max}›] 2900 idx=1287 mlai=0 forcedErr=‹kv/kvserver/replica_application_state_machine.go›:157: no-op on empty Raft entry: ‹<nil>›
I210330 10:30:41.007033 267 kv/kvserver/replica_application_cmd.go:172 ⋮ [n1,s1,r35/1:‹/Table/{39-40}›] 2901 idx=150 mlai=103 forcedErr=<nil>: ‹Put: 0,0 /Table/39/1/"\xe9\x1a\x89E1\x89K\xa9\xb4s B\xa1\xd9 \x95"/0 (0xaf8912e91a8945318
```
Release note (<category, see below>): <what> <show> <why>
Member
This keeps the WAL Release note: None
Release note (<category, see below>): <what> <show> <why>
88d61cc to
38156c0
Compare
Picks up cockroachdb/pebble#1104. Release note: None
Release note (<category, see below>): <what> <show> <why>
Release note: None
Use GCE only. We were previously throwing in random servers from a debian NTP pool as well. Might help with cockroachdb#62946. ``` ubuntu@tobias-chrony-gce-0001:~$ chronyc sources -v 210 Number of sources = 1 .-- Source mode '^' = server, '=' = peer, '#' = local clock. / .- Source state '*' = current synced, '+' = combined , '-' = not combined, | / '?' = unreachable, 'x' = time may be in error, '~' = time too variable. || .- xxxx [ yyyy ] +/- zzzz || Reachability register (octal) -. | xxxx = adjusted offset, || Log2(Polling interval) --. | | yyyy = measured offset, || \ | | zzzz = estimated error. || | | \ MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^* metadata.google.internal 2 6 37 21 +1511ns[ +12us] +/- 563us ``` Release note: None
This copies the GCE config, adjusting only the name server. The expectation is that this results in an "identical" configuration between AWS and GCE, mod the different time source. Release note: None
Release note: None
This should have fired the moment n3 tried to send the faux index 5171 to n1, as that Entry had Term 7, but follows an entry at Term 9. So the MsgApp would have had LogIndex=5170 LogTerm=9, and 9 > 7. Release note (<category, see below>): <what> <show> <why>
TODO Release note(bug fix): TODO
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Apr 8, 2021
This productionizes a number of assertions that were added in cockroachdb#62791, while investigating cockroachdb#61990. We don't have a good story about whether assertions always need to be behind a build tag. I added the assertions that are dirt cheap without a build tag, and where I augmented an existing assertion I kept it behind the existing build tag gating. If any of these assertions fires, it's a good chance that we are facing potential log corruption (the one in the entry cache is often benign, but once it leads to a term regression, it's a very likely bad time), so if anything, in the (relatively distant) future these assertions should fail "gracefully" at the replica level as opposed to taking down the node; we wouldn't want to disable them. Since we can't fail assertions at the replica level now, per-node is fine and they will have plenty of time to bake until they make it into a release. Release note: None
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Apr 12, 2021
This productionizes a number of assertions that were added in cockroachdb#62791, while investigating cockroachdb#61990. We don't have a good story about whether assertions always need to be behind a build tag. I added the assertions that are dirt cheap without a build tag, and where I augmented an existing assertion I kept it behind the existing build tag gating. If any of these assertions fires, it's a good chance that we are facing potential log corruption (the one in the entry cache is often benign, but once it leads to a term regression, it's a very likely bad time), so if anything, in the (relatively distant) future these assertions should fail "gracefully" at the replica level as opposed to taking down the node; we wouldn't want to disable them. Since we can't fail assertions at the replica level now, per-node is fine and they will have plenty of time to bake until they make it into a release. Release note: None
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Apr 12, 2021
This productionizes a number of assertions that were added in cockroachdb#62791, while investigating cockroachdb#61990. We don't have a good story about whether assertions always need to be behind a build tag. I added the assertions that are dirt cheap without a build tag, and where I augmented an existing assertion I kept it behind the existing build tag gating. If any of these assertions fires, it's a good chance that we are facing potential log corruption (the one in the entry cache is often benign, but once it leads to a term regression, it's a very likely bad time), so if anything, in the (relatively distant) future these assertions should fail "gracefully" at the replica level as opposed to taking down the node; we wouldn't want to disable them. Since we can't fail assertions at the replica level now, per-node is fine and they will have plenty of time to bake until they make it into a release. Release note: None
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Apr 12, 2021
This productionizes a number of assertions that were added in cockroachdb#62791, while investigating cockroachdb#61990. We don't have a good story about whether assertions always need to be behind a build tag. I added the assertions that are dirt cheap without a build tag, and where I augmented an existing assertion I kept it behind the existing build tag gating. If any of these assertions fires, it's a good chance that we are facing potential log corruption (the one in the entry cache is often benign, but once it leads to a term regression, it's a very likely bad time), so if anything, in the (relatively distant) future these assertions should fail "gracefully" at the replica level as opposed to taking down the node; we wouldn't want to disable them. Since we can't fail assertions at the replica level now, per-node is fine and they will have plenty of time to bake until they make it into a release. Release note: None
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Apr 12, 2021
This productionizes a number of assertions that were added in cockroachdb#62791, while investigating cockroachdb#61990. We don't have a good story about whether assertions always need to be behind a build tag. I added the assertions that are dirt cheap without a build tag, and where I augmented an existing assertion I kept it behind the existing build tag gating. If any of these assertions fires, it's a good chance that we are facing potential log corruption (the one in the entry cache is often benign, but once it leads to a term regression, it's a very likely bad time), so if anything, in the (relatively distant) future these assertions should fail "gracefully" at the replica level as opposed to taking down the node; we wouldn't want to disable them. Since we can't fail assertions at the replica level now, per-node is fine and they will have plenty of time to bake until they make it into a release. Release note: None
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See #61990.