Skip to content

Improve availability during pd rolling restart #8748

@lhy1024

Description

@lhy1024

Enhancement Task

After a PD instance is restarted, it may take some time to load region information. If a PD leader doesn't have all up-to-date region information, it may cause some problems.

When updating/upgrading PD, the TiDB Operator restarts PD instances one by one, e.g. if the current PD instance is ready, the next one will be restarted. But, this ready doesn't include region information sync.

A case with problems:

  1. All PD instances run for a while and with all region information synced
  2. PD-2 is the leader, and an updating operation is triggered
  3. TiDB Operator calls PD API to transfer leader from PD-2 to PD-1
  4. TiDB Operator restarts PD-2 and waits for PD-2 to be ready (but without the additional wait for region information sync)
  5. TiDB Operator calls PD API to transfer leader from PD-1 to PD-2
  6. As the region information in PD-2 is not synced, problems happen

At this point, the problem arises that PD-2 is elected as the leader, but it can't provide services related to region query until regions are loaded, it can only provide tso services.

The current workaround is to wait a while after the pd rolls reboot until the load region is complete and then let it be the leader
Typically, for a 10 million cluster, five to ten minutes is enough.

But in the longer term, it is more flexible for pd to provide an interface to query if the load region is complete.

Metadata

Metadata

Assignees

No one assigned

    Labels

    affects-7.5This bug affects the 7.5.x(LTS) versions.affects-8.1This bug affects the 8.1.x(LTS) versions.affects-8.5This bug affects the 8.5.x(LTS) versions.affects-9.0This bug affects the 9.0.x versions.type/enhancementThe issue or PR belongs to an enhancement.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions