-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: consistency check failure during import #36861
Description
To do / understand
- how did
nathan-tpcc-geo:7end up rotating into a new log file to log the fatal error without retaining any prior ones? (answer: log rotation storage: consistency check failure during import #36861 (comment)) - should we change something about where the diff goes or tweak the logging somehow?
- take a RocksDB checkpoint on all nodes before fatal inconsistency errors (PR: storage: take an engine checkpoint during failing consistency checks #36867)
This looks very similar to #35424, so it's possible that that issue wasn't fully resolved. I was most of the way through a TPC-C 4k import when a node died due to a consistency check failure.
F190416 01:58:51.634989 172922 storage/replica_consistency.go:220 [n5,consistencyChecker,s5,r590/1:/Table/68/1/{29/4/2…-31/4/1…}] consistency check failed with 1 inconsistent replicas
goroutine 172922 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000056301, 0xc000056300, 0x5449800, 0x1e)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1020 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5bdd700, 0xc000000004, 0x5449860, 0x1e, 0xdc, 0xc008bfa5a0, 0x79)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:878 +0x93d
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3aa1620, 0xc0071899e0, 0x4, 0x2, 0x33b2862, 0x36, 0xc01f06cce0, 0x1, 0x1)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d8
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3aa1620, 0xc0071899e0, 0x1, 0xc000000004, 0x33b2862, 0x36, 0xc01f06cce0, 0x1, 0x1)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x3aa1620, 0xc0071899e0, 0x33b2862, 0x36, 0xc01f06cce0, 0x1, 0x1)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).CheckConsistency(0xc023eeb400, 0x3aa1620, 0xc0071899e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_consistency.go:220 +0x6ce
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).CheckConsistency(0xc023eeb400, 0x3aa1620, 0xc0071899e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_consistency.go:229 +0x81b
github.com/cockroachdb/cockroach/pkg/storage.(*consistencyQueue).process(0xc0003de2a0, 0x3aa1620, 0xc0071899e0, 0xc023eeb400, 0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/consistency_queue.go:125 +0x210
Cockroach SHA: 3ebed10
Notes:
Cluster: nathan-tpcc-geo (stopped, extended for 48h)
Cockroach nodes: 1,2,4,5,7,8,10,11
Inconsistent range: r590
Replicas: nathan-tpcc-geo:2/n2/r3, nathan-tpcc-geo:5/n4/r4, and nathan-tpcc-geo:7/n5/r1
Inconsistent replica: nathan-tpcc-geo:7/n5/r1
Replicas in zones: europe-west2-b, europe-west4-b, and asia-northeast1-b respectively
Initial Investigation
Unlike in the later reproductions of #35424, replica 1's Raft log is an exact prefix of replica 3 and 4's, so this doesn't look like the same issue we saw later in that issue.
I haven't looked at much else yet.