-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage/engine: up-replication on new cluster sometimes fails on Pebble #44631
Copy link
Copy link
Closed
Description
I've seen this up-replication failure on a few roachtests. The easiest to reproduce so far is using acceptance/cli/node-status. A failure usually occurs within 5 runs:
~ COCKROACH_STORAGE_ENGINE=pebble roachtest run --local --artifacts artifacts.pebble --count 10 '^acceptance/cli/node-status$'
...
22:10:26 cluster.go:2490: still waiting for full replication
That then repeats forever and we never up-replicate. crdb_internal.ranges on this cluster shows that a number of ranges have only a single replica. The range debug page for one such range shows:
[n1,status] add - missing replica need=3, have=1, priority=10001.00
[n1,status] next replica action: add
[n1,status] allocate candidates: [ s2, valid:true, fulldisk:false, necessary:false, diversity:0.00, converges:0, balance:1, rangeCount:13, queriesPerSecond:1.57]
[n1,status] add target: s2, valid:true, fulldisk:false, necessary:false, diversity:0.00, converges:0, balance:1, rangeCount:13, queriesPerSecond:1.57
[n1,status] allocate candidates: []
[n1,status] error simulating allocator on replica [n1,s1,r1/1:/{Min-System/NodeL…}]: avoid up-replicating to fragile quorum: 0 of 2 live stores are able to take a new replica for the range (2 already have a replica); likely not enough nodes in cluster
I think this means that n1 is not considering n2 or n3 to be live. Need to dive in further to what is going on, and why this is related to Pebble. The same roachtest when run on RocksDB passes 10 out of 10 times.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels