-
Notifications
You must be signed in to change notification settings - Fork 4.1k
stability: resurrecting registration cluster #6991
Copy link
Copy link
Closed
Labels
S-1-stabilitySevere stability issues that can be fixed by upgrading, but usually don’t resolve by restartingSevere stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone
Description
@bdarnell: I'll be keeping track of actions and results here.
Quick summary:
the registration cluster is falling over repeatedly due to large snapshot sizes. Specifically, recipients of range 1 snapshots OOM during applySnapshot.
eg, on node 2 ec2-52-91-3-164.compute-1.amazonaws.com:
I160601 16:24:14.957105 storage/replica_raftstorage.go:610 received snapshot for range 1 at index 6818966. encoded size=1204695315, 14475 KV pairs, 384994 log entries
...
I160601 16:24:25.479455 /go/src/google.golang.org/grpc/clientconn.go:499 grpc: Conn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 172.31.14.204:26257: getsockopt: connection refused"; Reconnecting to "ip-172-31-14-204:26257"
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
SIGABRT: abort
PC=0x7f78fae4acc9 m=7
signal arrived during cgo execution
goroutine 86 [syscall, locked to thread]:
runtime.cgocall(0x11b2d80, 0xc822780308, 0x7f7800000000)
/usr/local/go/src/runtime/cgocall.go:123 +0x11b fp=0xc8227802b0 sp=0xc822780280
github.com/cockroachdb/cockroach/storage/engine._Cfunc_DBApplyBatchRepr(0x7f78ee825b90, 0xc917dce000, 0x208f, 0x0, 0x0)
??:0 +0x53 fp=0xc822780308 sp=0xc8227802b0
github.com/cockroachdb/cockroach/storage/engine.dbApplyBatchRepr(0x7f78ee825b90, 0xc917dce000, 0x208f, 0x4000000, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/storage/engine/rocksdb.go:990 +0x138 fp=0xc8227803a0 sp=0xc822780308
github.com/cockroachdb/cockroach/storage/engine.(*rocksDBBatch).ApplyBatchRepr(0xc822348000, 0xc917dce000, 0x208f, 0x4000000, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/storage/engine/rocksdb.go:578 +0x4f fp=0xc8227803d8 sp=0xc8227803a0
github.com/cockroachdb/cockroach/storage/engine.(*rocksDBBatch).flushMutations(0xc822348000)
/go/src/github.com/cockroachdb/cockroach/storage/engine/rocksdb.go:672 +0x146 fp=0xc822780468 sp=0xc8227803d8
github.com/cockroachdb/cockroach/storage/engine.(*rocksDBBatchIterator).Seek(0xc822348030, 0xc934324b80, 0x10, 0x20, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/storage/engine/rocksdb.go:498 +0x25 fp=0xc8227804a0 sp=0xc822780468
github.com/cockroachdb/cockroach/storage/engine.mvccGetMetadata(0x7f78fbaca080, 0xc822348030, 0xc934324b80, 0x10, 0x20, 0x0, 0x0, 0xc938fc6000, 0xc938fc6000, 0x11, ...)
/go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:684 +0xcf fp=0xc8227805d8 sp=0xc8227804a0
github.com/cockroachdb/cockroach/storage/engine.mvccPutInternal(0x7f78fba528b0, 0xc82000ef48, 0x7f78fbaca120, 0xc822348000, 0x7f78fbaca080, 0xc822348030, 0x0, 0xc934324b80, 0x10, 0x20, ...)
/go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:1020 +0x1f9 fp=0xc822780b20 sp=0xc8227805d8
github.com/cockroachdb/cockroach/storage/engine.mvccPutUsingIter(0x7f78fba528b0, 0xc82000ef48, 0x7f78fbaca120, 0xc822348000, 0x7f78fbaca080, 0xc822348030, 0x0, 0xc934324b80, 0x10, 0x20, ...)
/go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:988 +0x1bb fp=0xc822780bf8 sp=0xc822780b20
github.com/cockroachdb/cockroach/storage/engine.MVCCPut(0x7f78fba528b0, 0xc82000ef48, 0x7f78fbac9f98, 0xc822348000, 0x0, 0xc934324b80, 0x10, 0x20, 0x0, 0xc800000000, ...)
/go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:926 +0x1bb fp=0xc822780cc8 sp=0xc822780bf8
github.com/cockroachdb/cockroach/storage/engine.MVCCPutProto(0x7f78fba528b0, 0xc82000ef48, 0x7f78fbac9f98, 0xc822348000, 0x0, 0xc934324b80, 0x10, 0x20, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/storage/engine/mvcc.go:549 +0x1c5 fp=0xc822780d98 sp=0xc822780cc8
github.com/cockroachdb/cockroach/storage.(*Replica).append(0xc8237fd9e0, 0x7f78fbac9f98, 0xc822348000, 0x0, 0xc8f810e000, 0x5dfe2, 0x5dfe2, 0x1626, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/storage/replica_raftstorage.go:547 +0x20f fp=0xc822780ed8 sp=0xc822780d98
github.com/cockroachdb/cockroach/storage.(*Replica).applySnapshot(0xc8237fd9e0, 0x7f78fbac9f08, 0xc822348000, 0xc88f6b6000, 0x47ce3113, 0x47ce4000, 0xc821b06cc0, 0x4, 0x4, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/storage/replica_raftstorage.go:670 +0xc06 fp=0xc8227812f8 sp=0xc822780ed8
github.com/cockroachdb/cockroach/storage.(*Replica).handleRaftReady(0xc8237fd9e0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/storage/replica.go:1425 +0x29e fp=0xc822781bd0 sp=0xc8227812f8
github.com/cockroachdb/cockroach/storage.(*Store).processRaft.func1()
/go/src/github.com/cockroachdb/cockroach/storage/store.go:2055 +0x35d fp=0xc822781f60 sp=0xc822781bd0
github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker.func1(0xc8202efce0, 0xc82206b240)
/go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:139 +0x52 fp=0xc822781f80 sp=0xc822781f60
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc822781f88 sp=0xc822781f80
created by github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunWorker
/go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:140 +0x62
There is no corresponding "applied snapshot for range 1" message, and the stack trace does list an applySnapshot entry. Can't confirm from the trace that it is for that range (the range ID is not one of the simple arguments), but it most likely is. Similar pattern appeared multiple times.
I will perform the following to try to resurrect the cluster:
- stop all nodes
- backup all rocksdb data
- change the default zone config to have only two replicas
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
S-1-stabilitySevere stability issues that can be fixed by upgrading, but usually don’t resolve by restartingSevere stability issues that can be fixed by upgrading, but usually don’t resolve by restarting