[Question 1]:
For example, ChangeReplicas(...)
step 1. range 1 has A,B,C three replica
step 2. add replica D to range 1, then range 1 has A,B,C,D four replica
step 3. delete replica A from range 1, then range 1 has B,C,D three replica, replica A will be added to the replica_gc_queue
But, in step 3, it's a replica gc queue. If there are many elements in queue,
the scanner will scan the replicas in the store, meanwhile there is new leader lease which was took affect with the replica.
The queue.go MaybeAdd -> shouldQueue will return false,0 accroding to the lease expiration 24H and remove the replica from the queue.
Wait a moment, kill the cockroach, the cockroach will never be started again.
Beause the replica' local meta is not deleted yet, start command will load it and panic.
Summary,
ChangeReplicas(...) RemoveReplica operation add replica to the replica_gc_queue directly without shouldQueue.
Scanner will add/remove replica to/from the replica_gc_queue according to the shouldQueue.
The two of them will affect each other.
[Question 2]:
The replica_gc_queue is a asynchronous processing. If the queue has not been fully processed, the cockroach is down.
The local meta is not deleted yet. Restart will be fail.
How do you feel? Or other suggestions.