kv/concurrency: increase kv.lock_table.coordinator_liveness_push_delay to 50ms#50161
Conversation
…y to 50ms This change increases the default value for `kv.lock_table.coordinator_liveness_push_delay`. Increasing this delay is a big win for contended workloads because it allows transactions to queue locally to conflicting locks (in the lockTable's lockWaitQueue) instead of remotely next to the conflicting lock holders' txn records (in the txnWaitQueue) in more cases. This has a big effect on high contention / high concurrency workloads. As we'll see in the following experiment, YCSB workload A would change regimes somewhere between 200 and 256 concurrent threads. Operations would slow down just enough that waiting txns would start performing liveness pushes and become much less effective at queuing and responding to transaction commits. This would cause the rest of the operations to slow down and suddenly everyone was pushing and the front of every lockWaitQueue was waiting in the txnWaitQueue. The first group of selects and updates on the left (50/50 ratio, so they overlap) shows YCSB workload A run with 256 threads and a 10ms liveness push delay. We expect ~36k qps, but we're only sustaining between 5-15k qps. When we bump this delay to 50ms on the right, we see much higher throughput, stabilizing at ~36k. The workload is able to stay in the efficient regime (no liveness pushes, listening directly to lock state transitions) for far longer. I've tested up to 512 concurrent workers on the same workload and never seen us enter the slow regime. <todo throughput graph> <todo pushes graph> This partially addresses an existing TODO: > We could increase this default to somewhere on the order of the > transaction heartbeat timeout (5s) if we had a better way to avoid paying > the cost on each of a transaction's abandoned locks and instead only pay > it once per abandoned transaction per range or per node. This could come > in a few different forms, including: > - a per-store cache of recently detected abandoned transaction IDs > - a per-range reverse index from transaction ID to locked keys > > TODO(nvanbenschoten): increasing this default value. The finalizedTxnCache (introduced in cockroachdb#49218) gets us part of the way here. It allows us to pay the liveness push delay cost once per abandoned transaction per range instead of once per each of an abandoned transaction's locks. This helped us to feel comfortable increasing the default delay from the original 10ms to the current 50ms. Still, to feel comfortable increasing this further, we'd want to improve this cache (e.g. lifting it to the store level) to reduce the cost to once per abandoned transaction per store. To confirm that we're not regressing performance noticeably here, we run the same tests that we ran in cockroachdb#49218: ``` --- BEFORE SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '10ms'; CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK; SELECT * FROM keys; k ----- (0 rows) Time: 5.574853s CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK; INSERT INTO keys2 SELECT generate_series(1, 10000); INSERT 10000 Time: 33.155231s --- AFTER SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '50ms'; CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK; SELECT * FROM keys; k ----- (0 rows) Time: 5.547465s CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK; INSERT INTO keys2 SELECT generate_series(1, 10000); INSERT 10000 Time: 35.429984s ``` Release note (performance improvement): Transaction liveness pushes are now delayed by 50ms, up from 10ms. This allows high contention workloads to sustain high throughput up to much larger concurrency levels.
sumeerbhola
left a comment
There was a problem hiding this comment.
The improvement looks great!
I have some questions to get a better understanding of the broader context:
- Do we have other existing workloads with abandoned intents that we can run to check if there is a regression? Especially one with Puts, since PutRequests can't batch resolution.
- What is the typical user expectation for small transaction latency (I am trying to understand how that relates to the original 10ms and the new 50ms)?
- Did you try other settings > 10ms and < 50ms, and were they less effective?
- How much work is it to move the
finalizedTxnCacheto be a per store cache? I am slightly worried that the per-range size of 8 will be too small if there are only a few active or hot ranges and the rest are quiescent.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
No, I don't think so. I should have mentioned that the test I did above required me to disable transaction cleanup on rollbacks through a hack. In reality, transactions clean up after themselves, so the only real way to get abandoned intents is for nodes to crash. Here's how those two cases actually behave without the hack: And if the async cleanup was given a second to run:
It's worth noting that while PutRequests can't batch resolution, they still do use the
I'm not sure exactly what you mean here. Users typically want to get into single-digit ms latencies for single-statement txns. Does that help?
I tested 30ms and it seemed just as effective for this workload. 20ms was less so IIRC.
It wouldn't be a ton of work. I was thinking we would save it until we start performing memory accounting across all lockTables on a store though because that's when we'll need to start sharing state between different concurrency managers. |
sumeerbhola
left a comment
There was a problem hiding this comment.
Thanks -- the numbers without disabling cleanup put this into better context.
It's worth noting that while PutRequests can't batch resolution, they still do use the finalizedTxnCache, so the delay will still only be paid once per abandoned transaction per range. That's why the INSERT INTO cleanup case (with the abandoned intents) only gets about 6% slower and not ~5x slower.
Just for my curiosity, did the 10000 rows with an integer each result in multiple ranges (I thought the size limit was 512MB) or did you also reduce the range max size?
Users typically want to get into single-digit ms latencies for single-statement txns. Does that help?
Yes, that helps my mental model.
I was thinking we would save it until we start performing memory accounting across all lockTables on a store though because that's when we'll need to start sharing state between different concurrency managers.
Makes sense.
Reviewable status:
complete! 1 of 0 LGTMs obtained
|
TFTR!
No, it doesn't appear to be splitting. There is some variability in test though, which might explain the 2 second difference. FWIW if the delay was paid per intent, we'd expect the total time to go up by bors r+ |
Build failed (retrying...) |
Build succeeded |
This change increases the default value for
kv.lock_table.coordinator_liveness_push_delay. Increasing this delay is a big win for contended workloads because it allows transactions to queue locally to conflicting locks (in the lockTable's lockWaitQueue) instead of remotely next to the conflicting lock holders' txn records (in the txnWaitQueue) in more cases.This has a big effect on high contention / high concurrency workloads. As we'll see in the following experiment, YCSB workload A would change regimes somewhere between 200 and 256 concurrent threads. Operations would slow down just enough that waiting txns would start performing liveness pushes and become much less effective at queuing and responding to transaction commits. This would cause the rest of the operations to slow down and suddenly everyone was pushing and the front of every lockWaitQueue was waiting in the txnWaitQueue.
The first group of selects and updates on the left (50/50 ratio, so they overlap) shows YCSB workload A run with 256 threads and a 10ms liveness push delay. We expect ~36k qps, but we're only sustaining between 5-15k qps. When we bump this delay to 50ms on the right, we see much higher throughput, stabilizing at ~36k. The workload is able to stay in the efficient regime (no liveness pushes, listening directly to lock state transitions) for far longer. I've tested up to 512 concurrent workers on the same workload and never seen us enter the slow regime.
10ms delay on left, 50ms delay on right
This partially addresses an existing TODO:
The finalizedTxnCache (introduced in #49218) gets us part of the way here. It allows us to pay the liveness push delay cost once per abandoned transaction per range instead of once per each of an abandoned transaction's locks. This helped us to feel comfortable increasing the default delay from the original 10ms to the current 50ms. Still, to feel comfortable increasing this further, we'd want to improve this cache (e.g. lifting it to the store level) to reduce the cost to once per abandoned transaction per store.
To confirm that we're not regressing performance noticeably here, we run the same tests that we ran in #49218:
Release note (performance improvement): Transaction liveness pushes are now delayed by 50ms, up from 10ms. This allows high contention workloads to sustain high throughput up to much larger concurrency levels.