-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kv: add throttling for background GC operations based on store health #57248
Description
There is no throttling of GC based on store health, and we've seen situations where removal of the protected timestamp due to cancellation of stuck backups has caused a GC spike and overloaded the LSM store of a few nodes. This required significant manual intervention to restore the cluster to a healthy state, and customer unhappiness.
We should be throttling the proposals generated by the gcQueue based on the health of all the replica stores of a range. There is a concern that too much throttling could itself tip the stores into a different form of unhealthiness, with too many versions of a key. I think it is ok to set the default throttling to allow for moderate overload, like it is for ingestDelayL0Threshold (which is used when adding sstables).
This issue relates to #57247 , which also needs a store health signal for all replicas.
Jira issue: CRDB-2846