Skip to content

kv/concurrency: increase kv.lock_table.coordinator_liveness_push_delay to 50ms#50161

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/livenessPushDelay
Jun 16, 2020
Merged

kv/concurrency: increase kv.lock_table.coordinator_liveness_push_delay to 50ms#50161
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/livenessPushDelay

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Jun 12, 2020

This change increases the default value for kv.lock_table.coordinator_liveness_push_delay. Increasing this delay is a big win for contended workloads because it allows transactions to queue locally to conflicting locks (in the lockTable's lockWaitQueue) instead of remotely next to the conflicting lock holders' txn records (in the txnWaitQueue) in more cases.

This has a big effect on high contention / high concurrency workloads. As we'll see in the following experiment, YCSB workload A would change regimes somewhere between 200 and 256 concurrent threads. Operations would slow down just enough that waiting txns would start performing liveness pushes and become much less effective at queuing and responding to transaction commits. This would cause the rest of the operations to slow down and suddenly everyone was pushing and the front of every lockWaitQueue was waiting in the txnWaitQueue.

The first group of selects and updates on the left (50/50 ratio, so they overlap) shows YCSB workload A run with 256 threads and a 10ms liveness push delay. We expect ~36k qps, but we're only sustaining between 5-15k qps. When we bump this delay to 50ms on the right, we see much higher throughput, stabilizing at ~36k. The workload is able to stay in the efficient regime (no liveness pushes, listening directly to lock state transitions) for far longer. I've tested up to 512 concurrent workers on the same workload and never seen us enter the slow regime.

Screen Shot 2020-06-12 at 5 08 30 PM

Screen Shot 2020-06-12 at 5 18 27 PM

10ms delay on left, 50ms delay on right

This partially addresses an existing TODO:

We could increase this default to somewhere on the order of the
transaction heartbeat timeout (5s) if we had a better way to avoid paying
the cost on each of a transaction's abandoned locks and instead only pay
it once per abandoned transaction per range or per node. This could come
in a few different forms, including:

  • a per-store cache of recently detected abandoned transaction IDs
  • a per-range reverse index from transaction ID to locked keys

TODO(nvanbenschoten): increasing this default value.

The finalizedTxnCache (introduced in #49218) gets us part of the way here. It allows us to pay the liveness push delay cost once per abandoned transaction per range instead of once per each of an abandoned transaction's locks. This helped us to feel comfortable increasing the default delay from the original 10ms to the current 50ms. Still, to feel comfortable increasing this further, we'd want to improve this cache (e.g. lifting it to the store level) to reduce the cost to once per abandoned transaction per store.

To confirm that we're not regressing performance noticeably here, we run the same tests that we ran in #49218:

--- BEFORE
SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '10ms';

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.574853s


CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 33.155231s



--- AFTER
SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '50ms';

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.547465s


CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 35.429984s

Release note (performance improvement): Transaction liveness pushes are now delayed by 50ms, up from 10ms. This allows high contention workloads to sustain high throughput up to much larger concurrency levels.

…y to 50ms

This change increases the default value for
`kv.lock_table.coordinator_liveness_push_delay`. Increasing this delay
is a big win for contended workloads because it allows transactions to
queue locally to conflicting locks (in the lockTable's lockWaitQueue)
instead of remotely next to the conflicting lock holders' txn records
(in the txnWaitQueue) in more cases.

This has a big effect on high contention / high concurrency workloads.
As we'll see in the following experiment, YCSB workload A would change
regimes somewhere between 200 and 256 concurrent threads. Operations
would slow down just enough that waiting txns would start performing
liveness pushes and become much less effective at queuing and responding
to transaction commits. This would cause the rest of the operations to
slow down and suddenly everyone was pushing and the front of every
lockWaitQueue was waiting in the txnWaitQueue.

The first group of selects and updates on the left (50/50 ratio, so they
overlap) shows YCSB workload A run with 256 threads and a 10ms liveness
push delay. We expect ~36k qps, but we're only sustaining between 5-15k
qps. When we bump this delay to 50ms on the right, we see much higher
throughput, stabilizing at ~36k. The workload is able to stay in the
efficient regime (no liveness pushes, listening directly to lock state
transitions) for far longer. I've tested up to 512 concurrent workers on
the same workload and never seen us enter the slow regime.

<todo throughput graph>

<todo pushes graph>

This partially addresses an existing TODO:

> We could increase this default to somewhere on the order of the
> transaction heartbeat timeout (5s) if we had a better way to avoid paying
> the cost on each of a transaction's abandoned locks and instead only pay
> it once per abandoned transaction per range or per node. This could come
> in a few different forms, including:
> - a per-store cache of recently detected abandoned transaction IDs
> - a per-range reverse index from transaction ID to locked keys
>
> TODO(nvanbenschoten): increasing this default value.

The finalizedTxnCache (introduced in cockroachdb#49218) gets us part of the way
here. It allows us to pay the liveness push delay cost once per
abandoned transaction per range instead of once per each of an abandoned
transaction's locks. This helped us to feel comfortable increasing the
default delay from the original 10ms to the current 50ms. Still, to feel
comfortable increasing this further, we'd want to improve this cache
(e.g. lifting it to the store level) to reduce the cost to once per
abandoned transaction per store.

To confirm that we're not regressing performance noticeably here, we run
the same tests that we ran in cockroachdb#49218:
```
--- BEFORE
SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '10ms';

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.574853s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 33.155231s

--- AFTER
SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '50ms';

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 5.547465s

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 35.429984s
```

Release note (performance improvement): Transaction liveness pushes are
now delayed by 50ms, up from 10ms. This allows high contention workloads
to sustain high throughput up to much larger concurrency levels.
@nvb nvb requested a review from sumeerbhola June 12, 2020 22:15
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The improvement looks great!
I have some questions to get a better understanding of the broader context:

  • Do we have other existing workloads with abandoned intents that we can run to check if there is a regression? Especially one with Puts, since PutRequests can't batch resolution.
  • What is the typical user expectation for small transaction latency (I am trying to understand how that relates to the original 10ms and the new 50ms)?
  • Did you try other settings > 10ms and < 50ms, and were they less effective?
  • How much work is it to move the finalizedTxnCache to be a per store cache? I am slightly worried that the per-range size of 8 will be too small if there are only a few active or hot ranges and the rest are quiescent.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Jun 15, 2020

Do we have other existing workloads with abandoned intents that we can run to check if there is a regression? Especially one with Puts, since PutRequests can't batch resolution.

No, I don't think so. I should have mentioned that the test I did above required me to disable transaction cleanup on rollbacks through a hack. In reality, transactions clean up after themselves, so the only real way to get abandoned intents is for nodes to crash. Here's how those two cases actually behave without the hack:

SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '50ms';

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 847.812ms

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 787.806ms

And if the async cleanup was given a second to run:

SET CLUSTER SETTING kv.lock_table.coordinator_liveness_push_delay = '50ms';

CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); ROLLBACK;
SELECT pg_sleep(1);
SELECT * FROM keys;

  k
-----
(0 rows)

Time: 2.835ms

CREATE TABLE keys2 (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys2 SELECT generate_series(1, 10000); ROLLBACK;
SELECT pg_sleep(1);
INSERT INTO keys2 SELECT generate_series(1, 10000);

INSERT 10000

Time: 309.1ms

since PutRequests can't batch resolution

It's worth noting that while PutRequests can't batch resolution, they still do use the finalizedTxnCache, so the delay will still only be paid once per abandoned transaction per range. That's why the INSERT INTO cleanup case (with the abandoned intents) only gets about 6% slower and not ~5x slower.

What is the typical user expectation for small transaction latency

I'm not sure exactly what you mean here. Users typically want to get into single-digit ms latencies for single-statement txns. Does that help?

Did you try other settings > 10ms and < 50ms, and were they less effective?

I tested 30ms and it seemed just as effective for this workload. 20ms was less so IIRC.

How much work is it to move the finalizedTxnCache to be a per store cache? I am slightly worried that the per-range size of 8 will be too small if there are only a few active or hot ranges and the rest are quiescent.

It wouldn't be a ton of work. I was thinking we would save it until we start performing memory accounting across all lockTables on a store though because that's when we'll need to start sharing state between different concurrency managers.

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- the numbers without disabling cleanup put this into better context.

It's worth noting that while PutRequests can't batch resolution, they still do use the finalizedTxnCache, so the delay will still only be paid once per abandoned transaction per range. That's why the INSERT INTO cleanup case (with the abandoned intents) only gets about 6% slower and not ~5x slower.

Just for my curiosity, did the 10000 rows with an integer each result in multiple ranges (I thought the size limit was 512MB) or did you also reduce the range max size?

Users typically want to get into single-digit ms latencies for single-statement txns. Does that help?

Yes, that helps my mental model.

I was thinking we would save it until we start performing memory accounting across all lockTables on a store though because that's when we'll need to start sharing state between different concurrency managers.

Makes sense.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Jun 15, 2020

TFTR!

Just for my curiosity, did the 10000 rows with an integer each result in multiple ranges (I thought the size limit was 512MB) or did you also reduce the range max size?

No, it doesn't appear to be splitting. There is some variability in test though, which might explain the 2 second difference. FWIW if the delay was paid per intent, we'd expect the total time to go up by (10,000 * (50ms-10ms)) = 400s.

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 15, 2020

Build failed (retrying...)

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 16, 2020

Build succeeded

@craig craig bot merged commit 96feb13 into cockroachdb:master Jun 16, 2020
@nvb nvb deleted the nvanbenschoten/livenessPushDelay branch June 17, 2020 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants