kv/concurrency: batch intent resolution of pushed intents from same txn by nvb · Pull Request #104784 · cockroachdb/cockroach

nvb · 2023-06-13T17:28:18Z

This commit extends the infrastructure introduced in #49218 for transaction timestamp pushes. It avoids redundant txn pushes of PENDING transactions and batches the resolution of PENDING intents. This breaks the O(num_intents) work performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into something closer to O(num_ranges) work.

The commit accomplishes its goals by adding a second per-Range LRU cache of transactions that are PENDING and are known to have been pushed to higher timestamps. We use this cache for two purposes:

when we are a non-locking read and we see a lock at a conflicting timestamp who is held by a pushed txn above our read timestamp, we neither wait out the kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range).
we use the existence of a transaction in the cache as an indication that it may have written multiple intents, so we begin deferring intent resolution to enable batching.

Together, these two changes make us much more effective at pushing transactions with a large number of intents. The following example (from #103126) demonstrates this:

-- SETUP: run in a 3-node GCP roachprod cluster

--- session 1 - write 100k intents
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 100000);

--- session 2 - push intents with high-priority txn without uncertainty interval
BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms';
SELECT count(*) FROM keys;

--- BEFORE this PR and before #103265 (i.e. v23.1.2): takes ~7.1ms per intent
Time: 714.441s total

--- BEFORE this PR: takes ~1.5ms per intent
Time: 151.880s total

--- AFTER this PR: takes ~24μs per intent
Time: 2.405s

The change does have an unfortunate limitation. Deferred intent resolution is only currently enabled for non-locking readers without uncertainty intervals. Readers with uncertainty intervals must contend with the possibility of pushing a conflicting intent up into their uncertainty interval and causing more work for themselves, which is avoided with care by the lockTableWaiter but difficult to coordinate through the txnStatusCache. This limitation is acceptable because the most important case here is optimizing the Export requests issued by backup.

This limitation also hints at the long-term plan for this interaction, which is that non-locking readers can ignore known pending intents without the need to even resolve those intents (see #94730). This will require a request-scoped cache of pending, pushed transactions, which does not have the same problems with uncertainty intervals.

Release note (performance improvement): Backups no longer perform work proportional to the number of pending intents that they encounter, so they are over 100x faster when encountering long-running, bulk writing transactions.

cockroach-teamcity · 2023-06-13T17:28:30Z

This change is

arulajmani

modulo comments + our discussion from yesterday to not call into tryUpdateLockLocked from resumeScan for lock updates that correspond to pushed transactions.

Quite excited to see this perf improvement land! Thanks for doing it 💯

Reviewed 2 of 2 files at r1, 14 of 14 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/kvserver/concurrency/concurrency_control.go line 729 at r2 (raw file):

	// This is used by the lock table in a best-effort manner to avoid waiting on
	// locks of finalized or pushed transactions and telling the caller via
	// lockTableGuard.ResolveBeforeEvaluation to resolve a batch of intents.

This comment seems stale -- we've since renamed the method to be ResolveBeforeScanning. Should we update it while we're here?

pkg/kv/kvserver/concurrency/lock_table.go line 1710 at r2 (raw file):

		// opportunities.
		//
		// This fast-path is only enabled for readers without uncertainty intervals,

For readers that do have uncertainty intervals, would it be correct to use g.ts.Forward(g.txn.GlobalUncertaintyLimit) instead?

I realize it would not be as precise as taking local limits into consideration, but is there benefit to doing so instead of completely skipping this optimization?

pkg/kv/kvserver/concurrency/lock_table.go line 1722 at r2 (raw file):

				up := roachpb.MakeLockUpdate(pushedTxn, roachpb.Span{Key: l.key})
				if l.holder.holder[lock.Replicated].txn == nil {
					// Only held unreplicated. Update lock directly in case other

Noting that once non-locking readers stop blocking on Exclusive locks, we should be able to guarantee that there is an intent on the key if we ever hit this fast path. At that point, we should be able to assert that the lock is not only held as unreplicated.

Maybe worth capturing in a TODO, while we have all this paged in?

pkg/kv/kvserver/concurrency/lock_table.go line 1724 at r2 (raw file):

					// Only held unreplicated. Update lock directly in case other
					// waiting readers can benefit from the pushed timestamp.
					_, _ = l.tryUpdateLockLocked(up)

Following on from our conversation yesterday, where we decided we wouldn't call tryUpdateLockLocked from resumeScan for pushed transactions (we'll only do so for finalized transactions).

I don't think we want the same behaviour for unreplicated locks -- that's because they don't trigger resolve intent requests, so they won't be calling into UpdateLocks as a result. Mostly a note to myself that this should compose well with #104782.

pkg/kv/kvserver/concurrency/lock_table.go line 2858 at r2 (raw file):

		}

		if str == lock.None && !g.hasUncertaintyInterval() {

Consider adding a comment that explains this g.hasUncertaintyInterval check (or points to the comment in tryActiveWait).

pkg/kv/kvserver/concurrency/testdata/concurrency_manager/resolve_pushed_intents line 95 at r2 (raw file):

[4] sequence req1: lock wait-queue event: done waiting
[4] sequence req1: conflicted with ‹00000002-0000-0000-0000-000000000000› on ‹"a"› for 0.000s
[4] sequence req1: resolving a batch of 9 intent(s)

This would have made a nice before and after 💯

pkg/kv/kvserver/concurrency/testdata/concurrency_manager/resolve_pushed_intents_without_adding_to_lock_table line 224 at r2 (raw file):


sequence req=req2
----

Guess this is the before and after I was asking for 😂

pkg/kv/kvserver/concurrency/testdata/lock_table/resolve_pushed_txn_locks line 125 at r2 (raw file):


print
----

Once we implement the changes to not call into tryUpdateLock from resumeScan for lock updates that correspond to unfinalized transactions, we should see the lock on c back here as well, right?

We'll make use of these when constructing lock modes. Epic: none Release note: None

Small refactor.

Fixes cockroachdb#103126. This commit extends the infrastructure introduced in cockroachdb#49218 for transaction timestamp pushes. It avoids redundant txn pushes of PENDING transactions and batches the resolution of PENDING intents. This breaks the O(num_intents) work performed by high-priority scans (e.g. backups) over intent-heavy keyspaces into something closer to O(num_ranges) work. The commit accomplishes its goals by adding a second per-Range LRU cache of transactions that are PENDING and are known to have been pushed to higher timestamps. We use this cache for two purposes: 1. when we are a non-locking read and we see a lock at a conflicting timestamp who is held by a pushed txn above our read timestamp, we neither wait out the kv.lock_table.coordinator_liveness_push_delay (50 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range). 2. we use the existence of a transaction in the cache as an indication that it may have written multiple intents, so we begin deferring intent resolution to enable batching. Together, these two changes make us much more effective at pushing transactions with a large number of intents. The following example (from cockroachdb#103126) demonstrates this: ```sql -- SETUP: run in a 3-node GCP roachprod cluster --- session 1 - write 100k intents CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 100000); --- session 2 - push intents with high-priority txn without uncertainty interval BEGIN PRIORITY HIGH AS OF SYSTEM TIME '-1ms'; SELECT count(*) FROM keys; --- BEFORE this PR and before cockroachdb#103265 (i.e. v23.1.2): takes ~7.1ms per intent Time: 714.441s total --- BEFORE this PR: takes ~1.5ms per intent Time: 151.880s total --- AFTER this PR: takes ~24μs per intent Time: 2.405s ``` The change does have an unfortunate limitation. Deferred intent resolution is only currently enabled for non-locking readers without uncertainty intervals. Readers with uncertainty intervals must contend with the possibility of pushing a conflicting intent up into their uncertainty interval and causing more work for themselves, which is avoided with care by the lockTableWaiter but difficult to coordinate through the txnStatusCache. This limitation is acceptable because the most important case here is optimizing the Export requests issued by backup. This limitation also hints at the long-term plan for this interaction, which is that non-locking readers can ignore known pending intents without the need to even resolve those intents (see cockroachdb#94730). This will require a request-scoped cache of pending, pushed transactions, which does not have the same problems with uncertainty intervals. Release note (performance improvement): Backups no longer perform work proportional to the number of pending intents that they encounter, so they are over 100x faster when encountering long-running, bulk writing transactions.

nvb

TFTR!

I've also added a cluster setting (in a new commit) so that we can disable this behavior online if we find that it's a problem. Doing so required me to plumb in cluster settings into the lock table, so I cherry-picked the third commit from #104620 into this PR.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)

pkg/kv/kvserver/concurrency/concurrency_control.go line 729 at r2 (raw file):