Skip to content

Optimize suspending writes when memory usage is high #17392

@nolouch

Description

@nolouch

Background

Here are 3 tikv

  • tikv-2: Has the least remaining disk space (20G) < 5%, with some 'AlmostFull' errors, it refuse the write, and also other nodes do not send append log messages to this node (this issue lacks monitoring for handling proposals correctly if the majority of peers are disk full, as noted in Pull Request handle proposals correctly if majority peers are disk full by hicqu · Pull Request #10671 · tikv/tik).
    image

  • tikv0: because tikv-2 not push ahead raftlog, so tikv0, tikv-0 cannot compact log, the memory increase. after high memory usage in tikv0, it refuse to write(append):
    image
    image

  • tikv1: also cannot server write because other 2 node stop append.

image
image

What do you want to see

Improve this case: We need to prevent the expansion of this failure radius. the RAC is one node disk full, but the others are not full. which lets another node's memory increase and another node's server because busy, it is unreasonable. if tikv-0 can evict the entry cache or compaction log force, then tikv-0 will not cause fault propagation, thereby reducing the explosion radius.

Improve the observability: the tidb side and client side see some error backoff is regionMiss or regionUnabliable, from the log, it report server is busy from tikv-2, the root cause is actually from the tikv-1 with disk full.

Version

v6.1.3

Metadata

Metadata

Assignees

Labels

type/enhancementThe issue or PR belongs to an enhancement.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions