Skip to content

kv: add LockTimeout option to BatchRequest#68026

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/lockTimeout
Aug 10, 2021
Merged

kv: add LockTimeout option to BatchRequest#68026
craig[bot] merged 2 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/lockTimeout

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Jul 24, 2021

Informs #67513.

This commit introduces a new lock_timeout field on the BatchRequest Header struct.

lock_timeout specifies the maximum amount of time that the batch request will wait while attempting to acquire a lock on a key or while blocking on an existing lock in order to perform a non-locking read on a key. The time limit applies separately to each lock acquisition attempt. If the timeout elapses when waiting for a lock, a WriteIntentError will be returned.

Unlike in some other systems like PostgreSQL, where non-locking reads do not wait on conflicting locks, in CockroachDB, non-locking reads do wait for conflicting locks to be released. Because of this, the lock_timeout configuration applies to non-locking reads in read-write and read-only transactions as well.

Only the (default) Block wait policy will allow a request to wait on conflicting locks, so the timeout only makes sense in conjunction with the Block wait policy. The Error wait policy will throw an error immediately if a conflicting lock held by an active transaction is encountered, so this timeout can never be hit with an Error wait policy.

A value of zero disables the timeout.

@nvb nvb requested review from andreimatei and sumeerbhola July 24, 2021 02:07
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@nvb nvb force-pushed the nvanbenschoten/lockTimeout branch 3 times, most recently from b40ca3a to 38f2d63 Compare July 25, 2021 04:19
nvb added a commit to nvb/cockroach that referenced this pull request Jul 26, 2021
Fixes cockroachdb#67513.

In cockroachdb#68026, we added support for a new `lock_timeout` field on the
`BatchRequest.Header` struct.

A batch's lock_timeout specifies the maximum amount of time that it will
wait while attempting to acquire a lock on a key or while blocking on an
existing lock in order to perform a non-locking read on a key. The time
limit applies separately to each lock acquisition attempt. If the
timeout elapses when waiting for a lock, a WriteIntentError is returned.

This extension to the KV API allows us to properly support the
`lock_timeout` session variable in SQL, which matches these semantics
closely. When configured, this session variable aborts any statement
that waits longer than the specified amount of time while attempting to
acquire a single lock.

The only deviation from Postgres' behavior is that the lock_timeout
configuration applies to non-locking reads in read-write and read-only
transactions. This is because unlike in Postgres, where where
non-locking reads do not wait on conflicting locks, in CockroachDB,
non-locking reads do wait for conflicting locks to be released.

Release note (sql change): The lock_timeout session variable is now
supported. The configuration can be used to abort a query with an error
if it waits longer than the configured duration while blocking on a
single row-level lock acquisition.
@nvb nvb requested review from aayushshah15 and removed request for andreimatei July 28, 2021 19:30
Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read the tests yet. Looks good, with just some small questions.

Reviewed 2 of 2 files at r1, 10 of 18 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, and @sumeerbhola)


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 299 at r2 (raw file):

						ctx, req.LockTimeout,
						func(ctx context.Context) *Error { return w.pushLockTxn(ctx, req, state) },
						func(ctx context.Context) *Error { return w.pushLockTxnAfterTimeout(ctx, req, state) },

If it timed out while doing the push, I don't understand the reason for doing another push with error wait. The probability of success would be low. Is there extra information we need, like the latest state of the pushee?


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 456 at r2 (raw file):

		return doWithTimeoutAndFallback(
			ctx, req.LockTimeout,
			func(ctx context.Context) *Error { return w.pushLockTxn(ctx, req, state) },

same question here


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 777 at r2 (raw file):

		// NOTE: we look at timeoutCtx.Err() and not err to avoid the
		// potential for bugs if context cancellation is not propagated
		// correctly on some error paths.

So you are worried about err being nil, or err not correctly specifying that it was due to the deadline being exceeded?


pkg/kv/kvserver/concurrency/lock_table_waiter_test.go, line 477 at r1 (raw file):

	testutils.RunTrueAndFalse(t, "lockHeld", func(t *testing.T, lockHeld bool) {
		testutils.RunTrueAndFalse(t, "pusheeActive", func(t *testing.T, pusheeActive bool) {
			if !lockHeld && !pusheeActive {

can you add a comment like
// !lockHeld means a lock reservation, so is only possible when pusheeActive is true.

@nvb nvb force-pushed the nvanbenschoten/lockTimeout branch from 38f2d63 to e976aa8 Compare August 4, 2021 17:13
@nvb nvb requested review from a team as code owners August 4, 2021 17:13
@nvb nvb requested review from stevendanna and removed request for a team August 4, 2021 17:13
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @stevendanna, and @sumeerbhola)


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 299 at r2 (raw file):

Previously, sumeerbhola wrote…

If it timed out while doing the push, I don't understand the reason for doing another push with error wait. The probability of success would be low. Is there extra information we need, like the latest state of the pushee?

The idea is that on a timeout, we don't know whether the PUSH_TIMESTAMP/PUSH_ABORT actually got to the point of proving that the pushee was active and began waiting in its txnwait.Queue. The push may have timed out before this point. But just like with WaitPolicy_Error, we don't want to throw an error on abandoned locks. So on timeout, we issue a PUSH_TOUCH request (like we do for WaitPolicy_Error) that is not subject to the lock_timeout to check with certainty whether the conflict is active or not, but without blocking if it happens to be active.

If you buy all this, I'll add some more commentary to spell this out in the code.


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 777 at r2 (raw file):

Previously, sumeerbhola wrote…

So you are worried about err being nil, or err not correctly specifying that it was due to the deadline being exceeded?

The latter. As it turns out, this is also what contextutil.RunWithTimeout does, presumable for the same reason.


pkg/kv/kvserver/concurrency/lock_table_waiter_test.go, line 477 at r1 (raw file):

Previously, sumeerbhola wrote…

can you add a comment like
// !lockHeld means a lock reservation, so is only possible when pusheeActive is true.

Done.

@nvb nvb force-pushed the nvanbenschoten/lockTimeout branch from e976aa8 to a0380c8 Compare August 4, 2021 19:11
Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry about the delay
:lgtm:

Reviewed 18 of 18 files at r6.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, @stevendanna, and @sumeerbhola)


pkg/kv/kvserver/concurrency/concurrency_manager_test.go, line 162 at r6 (raw file):

					// without a delay by the lockTableWaiter, ensuring that the lock
					// timeout logic deterministically fires.
					// See (*lockTableWaiterImpl).timeUntilDeadline.

Does this mean we are not testing the doWithTimeoutAndFallback code?


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 299 at r2 (raw file):

actually got to the point of proving that the pushee was active and began waiting in its txnwait.Queue

And it may not because of slowness elsewhere like network or cpu overload?

If you buy all this, I'll add some more commentary to spell this out in the code.

sounds good


pkg/kv/kvserver/concurrency/lock_table_waiter_test.go, line 701 at r6 (raw file):

					h roachpb.Header,
					pushType roachpb.PushTxnType,
				) (*roachpb.Transaction, *Error) {

never mind my earlier question about testing of the fallback path.

nvb added 2 commits August 9, 2021 18:46
The test was not exercising the case where a pushee was still active
during the PUSH_TOUCH issued by requests with an Error wait policy.
Informs cockroachdb#67513.

This commit introduces a new `lock_timeout` field on the BatchRequest
Header struct.

lock_timeout specifies the maximum amount of time that the batch request
will wait while attempting to acquire a lock on a key or while blocking on
an existing lock in order to perform a non-locking read on a key. The time
limit applies separately to each lock acquisition attempt. If the timeout
elapses when waiting for a lock, a WriteIntentError will be returned.

Unlike in some other systems like PostgreSQL, where non-locking reads do
not wait on conflicting locks, in CockroachDB, non-locking reads do wait
for conflicting locks to be released. Because of this, the lock_timeout
configuration applies to non-locking reads in read-write and read-only
transactions as well.

Only the (default) Block wait policy will allow a request to wait on
conflicting locks, so the timeout only makes sense in conjunction with the
Block wait policy. The Error wait policy will throw an error immediately if
a conflicting lock held by an active transaction is encountered, so this
timeout can never be hit with an Error wait policy.

A value of zero disables the timeout.
@nvb nvb force-pushed the nvanbenschoten/lockTimeout branch from a0380c8 to 77405a5 Compare August 9, 2021 23:00
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry about the delay

Not a problem at all, you were away for a few weeks and this was a big PR. Thanks for getting through all of it!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @stevendanna, and @sumeerbhola)


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 299 at r2 (raw file):

And it may not because of slowness elsewhere like network or cpu overload?

Exactly. I added some commentary along these lines.

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 10, 2021

Build succeeded:

@craig craig bot merged commit 847514d into cockroachdb:master Aug 10, 2021
@nvb nvb deleted the nvanbenschoten/lockTimeout branch August 10, 2021 02:43
nvb added a commit to nvb/cockroach that referenced this pull request Aug 10, 2021
Fixes cockroachdb#67513.

In cockroachdb#68026, we added support for a new `lock_timeout` field on the
`BatchRequest.Header` struct.

A batch's lock_timeout specifies the maximum amount of time that it will
wait while attempting to acquire a lock on a key or while blocking on an
existing lock in order to perform a non-locking read on a key. The time
limit applies separately to each lock acquisition attempt. If the
timeout elapses when waiting for a lock, a WriteIntentError is returned.

This extension to the KV API allows us to properly support the
`lock_timeout` session variable in SQL, which matches these semantics
closely. When configured, this session variable aborts any statement
that waits longer than the specified amount of time while attempting to
acquire a single lock.

The only deviation from Postgres' behavior is that the lock_timeout
configuration applies to non-locking reads in read-write and read-only
transactions. This is because unlike in Postgres, where where
non-locking reads do not wait on conflicting locks, in CockroachDB,
non-locking reads do wait for conflicting locks to be released.

Release note (sql change): The lock_timeout session variable is now
supported. The configuration can be used to abort a query with an error
if it waits longer than the configured duration while blocking on a
single row-level lock acquisition.
craig bot pushed a commit that referenced this pull request Aug 10, 2021
68042: sql: support lock_timeout session variable r=nvanbenschoten a=nvanbenschoten

Fixes #67513.

Please ignore the first two commits, which are from #68026.

In #68026, we added support for a new `lock_timeout` field on the `BatchRequest.Header` struct. 

A batch's lock_timeout specifies the maximum amount of time that it will wait while attempting to acquire a lock on a key or while blocking on an existing lock in order to perform a non-locking read on a key. The time limit applies separately to each lock acquisition attempt. If the timeout elapses when waiting for a lock, a WriteIntentError is returned.

This extension to the KV API allows us to properly support the `lock_timeout` session variable in SQL, which matches these semantics closely. When configured, this session variable aborts any statement that waits longer than the specified amount of time while attempting to acquire a single lock.

The only deviation from Postgres' behavior is that the lock_timeout configuration applies to non-locking reads in read-write and read-only transactions. This is because unlike in Postgres, where where non-locking reads do not wait on conflicting locks, in CockroachDB, non-locking reads do wait for conflicting locks to be released.

Release note (sql change): The lock_timeout session variable is now supported. The configuration can be used to abort a query with an error if it waits longer than the configured duration while blocking on a single row-level lock acquisition.

@jordanlewis I'm adding you for the review, but feel free to re-assign.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
craig bot pushed a commit that referenced this pull request Aug 10, 2021
68042: sql: support lock_timeout session variable r=nvanbenschoten a=nvanbenschoten

Fixes #67513.

Please ignore the first two commits, which are from #68026.

In #68026, we added support for a new `lock_timeout` field on the `BatchRequest.Header` struct. 

A batch's lock_timeout specifies the maximum amount of time that it will wait while attempting to acquire a lock on a key or while blocking on an existing lock in order to perform a non-locking read on a key. The time limit applies separately to each lock acquisition attempt. If the timeout elapses when waiting for a lock, a WriteIntentError is returned.

This extension to the KV API allows us to properly support the `lock_timeout` session variable in SQL, which matches these semantics closely. When configured, this session variable aborts any statement that waits longer than the specified amount of time while attempting to acquire a single lock.

The only deviation from Postgres' behavior is that the lock_timeout configuration applies to non-locking reads in read-write and read-only transactions. This is because unlike in Postgres, where where non-locking reads do not wait on conflicting locks, in CockroachDB, non-locking reads do wait for conflicting locks to be released.

Release note (sql change): The lock_timeout session variable is now supported. The configuration can be used to abort a query with an error if it waits longer than the configured duration while blocking on a single row-level lock acquisition.

@jordanlewis I'm adding you for the review, but feel free to re-assign.

68313: kv: avoid regressions in closed timestamp sidetransportAccess r=nvanbenschoten a=nvanbenschoten

This commit strengthens the guarantees of the `sidetransportAccess`, which now promises that the returned timestamp from its get method will never regress across sequential calls. This is made possible by retaining two closed timestamps in the access - a `cur` closed timestamp that is known to be applied and an optional `next` closed timestamp that is not yet applied. Using two timestamps ensures that we never lose information about an applied closed timestamp, so we can make the stronger guarantee to users of the abstraction that returned timestamp will always increase monotonically across consecutive calls.

This was causing flakiness in tests added in #68194. The property-based tests in that PR were asserting (in one form or another) that after a `QueryResolvedTimestamp` request returns a resolved timestamp, a subsequent read-only request at that timestamp will never block on replication (i.e. the closed timestamp) or on conflicting transactions when executed. These tests were very rarely finding this not to be the case. The reason for this was that it was possible for the `sidetransportAccess` to regress across two sequential calls.

cc. @andreimatei, but not assigning because you're away.

68553: util/log: start garbage collection routines for old log files r=knz a=rauchenstein

This fixes a regression.  The call the routine that deletes old log
files when max-group-size is configured was incorrectly removed in a
refactor, so that functionality was disabled.  This change reinstates
it.

Release note (bug fix): File logs respect max-group-size configuration.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Jay Rauchenstein <rauchenstein@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants