kv/concurrency: increase kv.lock_table.coordinator_liveness_push_delay to 50ms by nvb · Pull Request #50161 · cockroachdb/cockroach

nvb · 2020-06-12T22:15:43Z

This change increases the default value for kv.lock_table.coordinator_liveness_push_delay. Increasing this delay is a big win for contended workloads because it allows transactions to queue locally to conflicting locks (in the lockTable's lockWaitQueue) instead of remotely next to the conflicting lock holders' txn records (in the txnWaitQueue) in more cases.

This has a big effect on high contention / high concurrency workloads. As we'll see in the following experiment, YCSB workload A would change regimes somewhere between 200 and 256 concurrent threads. Operations would slow down just enough that waiting txns would start performing liveness pushes and become much less effective at queuing and responding to transaction commits. This would cause the rest of the operations to slow down and suddenly everyone was pushing and the front of every lockWaitQueue was waiting in the txnWaitQueue.

The first group of selects and updates on the left (50/50 ratio, so they overlap) shows YCSB workload A run with 256 threads and a 10ms liveness push delay. We expect ~36k qps, but we're only sustaining between 5-15k qps. When we bump this delay to 50ms on the right, we see much higher throughput, stabilizing at ~36k. The workload is able to stay in the efficient regime (no liveness pushes, listening directly to lock state transitions) for far longer. I've tested up to 512 concurrent workers on the same workload and never seen us enter the slow regime.

10ms delay on left, 50ms delay on right

This partially addresses an existing TODO:

We could increase this default to somewhere on the order of the
transaction heartbeat timeout (5s) if we had a better way to avoid paying
the cost on each of a transaction's abandoned locks and instead only pay
it once per abandoned transaction per range or per node. This could come
in a few different forms, including:

a per-store cache of recently detected abandoned transaction IDs

a per-range reverse index from transaction ID to locked keys

TODO(nvanbenschoten): increasing this default value.

The finalizedTxnCache (introduced in #49218) gets us part of the way here. It allows us to pay the liveness push delay cost once per abandoned transaction per range instead of once per each of an abandoned transaction's locks. This helped us to feel comfortable increasing the default delay from the original 10ms to the current 50ms. Still, to feel comfortable increasing this further, we'd want to improve this cache (e.g. lifting it to the store level) to reduce the cost to once per abandoned transaction per store.

To confirm that we're not regressing performance noticeably here, we run the same tests that we ran in #49218:

--- BEFORE
SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '10ms';

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.574853s


CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 33.155231s



--- AFTER
SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '50ms';

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.547465s


CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 35.429984s

Release note (performance improvement): Transaction liveness pushes are now delayed by 50ms, up from 10ms. This allows high contention workloads to sustain high throughput up to much larger concurrency levels.

…y to 50ms This change increases the default value for `kv.lock_table.coordinator_liveness_push_delay`. Increasing this delay is a big win for contended workloads because it allows transactions to queue locally to conflicting locks (in the lockTable's lockWaitQueue) instead of remotely next to the conflicting lock holders' txn records (in the txnWaitQueue) in more cases. This has a big effect on high contention / high concurrency workloads. As we'll see in the following experiment, YCSB workload A would change regimes somewhere between 200 and 256 concurrent threads. Operations would slow down just enough that waiting txns would start performing liveness pushes and become much less effective at queuing and responding to transaction commits. This would cause the rest of the operations to slow down and suddenly everyone was pushing and the front of every lockWaitQueue was waiting in the txnWaitQueue. The first group of selects and updates on the left (50/50 ratio, so they overlap) shows YCSB workload A run with 256 threads and a 10ms liveness push delay. We expect ~36k qps, but we're only sustaining between 5-15k qps. When we bump this delay to 50ms on the right, we see much higher throughput, stabilizing at ~36k. The workload is able to stay in the efficient regime (no liveness pushes, listening directly to lock state transitions) for far longer. I've tested up to 512 concurrent workers on the same workload and never seen us enter the slow regime. <todo throughput graph> <todo pushes graph> This partially addresses an existing TODO: > We could increase this default to somewhere on the order of the > transaction heartbeat timeout (5s) if we had a better way to avoid paying > the cost on each of a transaction's abandoned locks and instead only pay > it once per abandoned transaction per range or per node. This could come > in a few different forms, including: > - a per-store cache of recently detected abandoned transaction IDs > - a per-range reverse index from transaction ID to locked keys > > TODO(nvanbenschoten): increasing this default value. The finalizedTxnCache (introduced in cockroachdb#49218) gets us part of the way here. It allows us to pay the liveness push delay cost once per abandoned transaction per range instead of once per each of an abandoned transaction's locks. This helped us to feel comfortable increasing the default delay from the original 10ms to the current 50ms. Still, to feel comfortable increasing this further, we'd want to improve this cache (e.g. lifting it to the store level) to reduce the cost to once per abandoned transaction per store. To confirm that we're not regressing performance noticeably here, we run the same tests that we ran in cockroachdb#49218: ``` --- BEFORE SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '10ms'; CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK; SELECT * FROM keys; k ----- (0 rows) Time: 5.574853s CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK; INSERT INTO keys2 SELECT generate_series(1, 10000); INSERT 10000 Time: 33.155231s --- AFTER SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '50ms'; CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK; SELECT * FROM keys; k ----- (0 rows) Time: 5.547465s CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK; INSERT INTO keys2 SELECT generate_series(1, 10000); INSERT 10000 Time: 35.429984s ``` Release note (performance improvement): Transaction liveness pushes are now delayed by 50ms, up from 10ms. This allows high contention workloads to sustain high throughput up to much larger concurrency levels.

cockroach-teamcity · 2020-06-12T22:15:50Z

This change is

sumeerbhola

The improvement looks great!
I have some questions to get a better understanding of the broader context:

Do we have other existing workloads with abandoned intents that we can run to check if there is a regression? Especially one with Puts, since PutRequests can't batch resolution.
What is the typical user expectation for small transaction latency (I am trying to understand how that relates to the original 10ms and the new 50ms)?
Did you try other settings > 10ms and < 50ms, and were they less effective?
How much work is it to move the finalizedTxnCache to be a per store cache? I am slightly worried that the per-range size of 8 will be too small if there are only a few active or hot ranges and the rest are quiescent.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

nvb · 2020-06-15T17:04:04Z

Do we have other existing workloads with abandoned intents that we can run to check if there is a regression? Especially one with Puts, since PutRequests can't batch resolution.

No, I don't think so. I should have mentioned that the test I did above required me to disable transaction cleanup on rollbacks through a hack. In reality, transactions clean up after themselves, so the only real way to get abandoned intents is for nodes to crash. Here's how those two cases actually behave without the hack:

SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '50ms';

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 847.812ms

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 787.806ms

And if the async cleanup was given a second to run:

SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '50ms';

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT pg_sleep(1);
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2.835ms

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
SELECT pg_sleep(1);
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 309.1ms

since PutRequests can't batch resolution

It's worth noting that while PutRequests can't batch resolution, they still do use the finalizedTxnCache, so the delay will still only be paid once per abandoned transaction per range. That's why the INSERT INTO cleanup case (with the abandoned intents) only gets about 6% slower and not ~5x slower.

What is the typical user expectation for small transaction latency

I'm not sure exactly what you mean here. Users typically want to get into single-digit ms latencies for single-statement txns. Does that help?

Did you try other settings > 10ms and < 50ms, and were they less effective?

I tested 30ms and it seemed just as effective for this workload. 20ms was less so IIRC.

How much work is it to move the finalizedTxnCache to be a per store cache? I am slightly worried that the per-range size of 8 will be too small if there are only a few active or hot ranges and the rest are quiescent.

It wouldn't be a ton of work. I was thinking we would save it until we start performing memory accounting across all lockTables on a store though because that's when we'll need to start sharing state between different concurrency managers.

sumeerbhola

Thanks -- the numbers without disabling cleanup put this into better context.

It's worth noting that while PutRequests can't batch resolution, they still do use the finalizedTxnCache, so the delay will still only be paid once per abandoned transaction per range. That's why the INSERT INTO cleanup case (with the abandoned intents) only gets about 6% slower and not ~5x slower.

Just for my curiosity, did the 10000 rows with an integer each result in multiple ranges (I thought the size limit was 512MB) or did you also reduce the range max size?

Users typically want to get into single-digit ms latencies for single-statement txns. Does that help?

Yes, that helps my mental model.

I was thinking we would save it until we start performing memory accounting across all lockTables on a store though because that's when we'll need to start sharing state between different concurrency managers.

Makes sense.

Reviewable status: complete! 1 of 0 LGTMs obtained

nvb · 2020-06-15T19:34:57Z

TFTR!

Just for my curiosity, did the 10000 rows with an integer each result in multiple ranges (I thought the size limit was 512MB) or did you also reduce the range max size?

No, it doesn't appear to be splitting. There is some variability in test though, which might explain the 2 second difference. FWIW if the delay was paid per intent, we'd expect the total time to go up by (10,000 * (50ms-10ms)) = 400s.

bors r+

craig · 2020-06-15T20:37:34Z

Build failed (retrying...)

GitHub CI (Cockroach)

craig · 2020-06-16T00:28:41Z

Build succeeded

GitHub CI (Cockroach)

nvb requested a review from sumeerbhola June 12, 2020 22:15

sumeerbhola reviewed Jun 15, 2020

View reviewed changes

sumeerbhola approved these changes Jun 15, 2020

View reviewed changes

craig bot merged commit 96feb13 into cockroachdb:master Jun 16, 2020

nvb deleted the nvanbenschoten/livenessPushDelay branch June 17, 2020 16:44

jseldess mentioned this pull request Jul 30, 2020

kv/concurrency: increase kv.lock_table.coordinator_liveness_push_delay to 50ms cockroachdb/docs#7885

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv/concurrency: increase kv.lock_table.coordinator_liveness_push_delay to 50ms#50161

kv/concurrency: increase kv.lock_table.coordinator_liveness_push_delay to 50ms#50161
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/livenessPushDelay

nvb commented Jun 12, 2020

Uh oh!

cockroach-teamcity commented Jun 12, 2020

Uh oh!

sumeerbhola left a comment

Uh oh!

nvb commented Jun 15, 2020

Uh oh!

sumeerbhola left a comment

Uh oh!

nvb commented Jun 15, 2020

Uh oh!

craig bot commented Jun 15, 2020

Uh oh!

craig bot commented Jun 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nvb commented Jun 12, 2020

Uh oh!

cockroach-teamcity commented Jun 12, 2020

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

nvb commented Jun 15, 2020

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

nvb commented Jun 15, 2020

Uh oh!

craig bot commented Jun 15, 2020

Build failed (retrying...)

Uh oh!

craig bot commented Jun 16, 2020

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants