-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvflowcontrol,admission: use flow control during raft log catchup post node-restart #98710
Description
Is your feature request related to a problem? Please describe.
We've seen in write-heavy workloads that node restarts can result in LSM inversion due to a rapid onset of raft log catchup appends. This problem was touched on recently in #96521 + #95159 -- those issues amounted to 23.1 changes to avoid an immediate transfer of leases to newly-restarted nodes until their LSM is healthier, in order to stave off latency impact for leaseholder traffic. But it's possible to invert the LSM, which affects non-leaseholder traffic.
This issue proposes using the general flow control mechanism we're introducing in #95563 to pace the rate of catchup raft log appends to prevent LSM inversion entirely. With such a mechanism, we'd be able to transfer leases immediately to newly restarted nodes without lease-holder impact, and also avoid latency impact on follower traffic. #80607 is slightly related -- we could apply flow tokens to raft snapshots too to cover the general case of "catchup write traffic".
Describe the solution you'd like
cockroach/pkg/kv/kvserver/kvflowcontrol/doc.go
Lines 310 to 328 in b84f10c
| // I11. What happens when a node is restarted and is being caught up rapidly | |
| // through raft log appends? We know of cases where the initial log appends | |
| // and subsequent state machine application be large enough to invert the | |
| // LSM[^9]. Imagine large block writes with a uniform key distribution; we | |
| // may persist log entries rapidly across many replicas (without inverting | |
| // the LSM, so follower pausing is also of no help) and during state | |
| // machine application, create lots of overlapping files/sublevels in L0. | |
| // - We want to pace the initial rate of log appends while factoring in the | |
| // effect of the subsequent state machine application on L0 (modulo [^9]). We | |
| // can use flow tokens for this too. In I3a we outlined how for quorum writes | |
| // that includes a replica on some recently re-started node, we need to wait | |
| // for it to be sufficiently caught before deducting/blocking for flow tokens. | |
| // Until that point we can use flow tokens on sender nodes that wish to send | |
| // catchup MsgApps to the newly-restarted node. Similar to the steady state, | |
| // flow tokens are only be returned once log entries are logically admitted | |
| // (which takes into account any apply-time write amplification, modulo [^9]). | |
| // Once the node is sufficiently caught up with respect to all its raft logs, | |
| // it can transition into the mode described in I3a where we deduct/block for | |
| // flow tokens for subsequent quorum writes. |
cockroach/pkg/kv/kvserver/kvflowcontrol/doc.go
Lines 353 to 375 in b84f10c
| // [^9]: With async raft storage writes (#17500, etcd-io/raft#8), we can | |
| // decouple raft log appends and state machine application (see #94854 and | |
| // #94853). So we could append at a higher rate than applying. Since | |
| // application can be arbitrarily deferred, we cause severe LSM | |
| // inversions. Do we want some form of pacing of log appends then, | |
| // relative to observed state machine application? Perhaps specifically in | |
| // cases where we're more likely to append faster than apply, like node | |
| // restarts. We're likely to defeat AC's IO control otherwise. | |
| // - For what it's worth, this "deferred application with high read-amp" | |
| // was also a problem before async raft storage writes. Consider many | |
| // replicas on an LSM, all of which appended a few raft log entries | |
| // without applying, and at apply time across all those replicas, we end | |
| // up inverting the LSM. | |
| // - Since we don't want to wait below raft, one way bound the lag between | |
| // appended entries and applied ones is to only release flow tokens for | |
| // an entry at position P once the applied state position >= P - delta. | |
| // We'd have to be careful, if we're not applying due to quorum loss | |
| // (as a result of remote node failure(s)), we don't want to deplete | |
| // flow tokens and cause interference on other ranges. | |
| // - If this all proves too complicated, we could just not let state | |
| // machine application get significantly behind due to local scheduling | |
| // reasons by using the same goroutine to do both async raft log writes | |
| // and state machine application. |
This is also touched on here: https://reviewable.io/reviews/cockroachdb/cockroach/96642#-NOF0fl20XuBrK4Ywlzp.
Jira issue: CRDB-25460