Skip to content

storage/engine: up-replication on new cluster sometimes fails on Pebble #44631

@petermattis

Description

@petermattis

I've seen this up-replication failure on a few roachtests. The easiest to reproduce so far is using acceptance/cli/node-status. A failure usually occurs within 5 runs:

~ COCKROACH_STORAGE_ENGINE=pebble roachtest run --local --artifacts artifacts.pebble --count 10 '^acceptance/cli/node-status$'
...
22:10:26 cluster.go:2490: still waiting for full replication

That then repeats forever and we never up-replicate. crdb_internal.ranges on this cluster shows that a number of ranges have only a single replica. The range debug page for one such range shows:


[n1,status] add - missing replica need=3, have=1, priority=10001.00
[n1,status] next replica action: add
[n1,status] allocate candidates: [ s2, valid:true, fulldisk:false, necessary:false, diversity:0.00, converges:0, balance:1, rangeCount:13, queriesPerSecond:1.57]
[n1,status] add target: s2, valid:true, fulldisk:false, necessary:false, diversity:0.00, converges:0, balance:1, rangeCount:13, queriesPerSecond:1.57
[n1,status] allocate candidates: []
[n1,status] error simulating allocator on replica [n1,s1,r1/1:/{Min-System/NodeL…}]: avoid up-replicating to fragile quorum: 0 of 2 live stores are able to take a new replica for the range (2 already have a replica); likely not enough nodes in cluster

I think this means that n1 is not considering n2 or n3 to be live. Need to dive in further to what is going on, and why this is related to Pebble. The same roachtest when run on RocksDB passes 10 out of 10 times.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions