Skip to content

ChangeReplicas(...) REMOVE replica process fail, because scanner delete it from the queue according shouldQueue #4101

@jonyguo

Description

@jonyguo

[Question 1]:
For example, ChangeReplicas(...)
step 1. range 1 has A,B,C three replica
step 2. add replica D to range 1, then range 1 has A,B,C,D four replica
step 3. delete replica A from range 1, then range 1 has B,C,D three replica, replica A will be added to the replica_gc_queue

But, in step 3, it's a replica gc queue. If there are many elements in queue,
the scanner will scan the replicas in the store, meanwhile there is new leader lease which was took affect with the replica.
The queue.go MaybeAdd -> shouldQueue will return false,0 accroding to the lease expiration 24H and remove the replica from the queue.
Wait a moment, kill the cockroach, the cockroach will never be started again.
Beause the replica' local meta is not deleted yet, start command will load it and panic.

Summary,
ChangeReplicas(...) RemoveReplica operation add replica to the replica_gc_queue directly without shouldQueue.
Scanner will add/remove replica to/from the replica_gc_queue according to the shouldQueue.
The two of them will affect each other.

[Question 2]:
The replica_gc_queue is a asynchronous processing. If the queue has not been fully processed, the cockroach is down.
The local meta is not deleted yet. Restart will be fail.

How do you feel? Or other suggestions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions