Skip to content

kv: async resolve unreplicated locks on other ranges after one-phase commit#98044

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/sfuExternal
Mar 9, 2023
Merged

kv: async resolve unreplicated locks on other ranges after one-phase commit#98044
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/sfuExternal

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Mar 6, 2023

Fixes #94400.

This commit fixes async lock resolution on external ranges after a transaction exercises the one-phase commit fast-path. Before the change, the 1PC fast-path was failing to handle the case where a transaction had acquired locks on ranges other than the range with the transaction record. As a result, these locks were not asynchronously but eagerly cleaned up. Conflicting transactions that encountered these locks would then wait 50ms before pushing the lock holder, finding it to be committed, and removing the locks themselves.

It's fairly rare for a transaction to acquire locks on multiple ranges but then still be able to perform a 1PC commit. Regardless, we should handle such cases.

Release note (bug fix): Fixed a bug where transactions that performed a SELECT FOR UPDATE across multiple ranges but never performed writes could fail to eagerly clean up their locks after commit. Future transactions that encountered these abandoned locks could be delayed for 50ms before unlocking them.

@KaiSun314 I'm adding you as an optional reviewer here because this PR extends some of the new testing that you added for intent resolution.

@nvb nvb requested a review from arulajmani March 6, 2023 06:26
@nvb nvb requested a review from a team as a code owner March 6, 2023 06:26
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@nvb nvb force-pushed the nvanbenschoten/sfuExternal branch from bd0f7a1 to 966f81d Compare March 6, 2023 15:54
Copy link
Copy Markdown
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/intentresolver/intent_resolver_integration_test.go line 89 at r1 (raw file):

	defer leaktest.AfterTest(t)()
	defer log.Scope(t).Close(t)

For this test and the one below, I know it's expected that we hit 1PC, but is there an easy way to assert that in the test?


pkg/kv/kvserver/intentresolver/intent_resolver_integration_test.go line 114 at r1 (raw file):

// split into two ranges. It then asserts that all intents and locks across the
// table's ranges are resolved.
func testAsyncIntentResolution(t *testing.T, fn func(*gosql.DB)) {

nit: Could we have a single TestAsyncIntentResolution function with individual subtests using t.Run instead? Then, this thing can just be a local function.

…commit

Fixes cockroachdb#94400.

This commit fixes async lock resolution on external ranges after a transaction
exercises the one-phase commit fast-path. Before the change, the 1PC fast-path
was failing to handle the case where a transaction had acquired locks on ranges
other than the range with the transaction record. As a result, these locks were
not asynchronously but eagerly cleaned up. Conflicting transactions that
encountered these locks would then wait 50ms before pushing the lock holder,
finding it to be committed, and removing the locks themselves.

It's fairly rare for a transaction to acquire locks on multiple ranges but then
still be able to perform a 1PC commit. Regardless, we should handle such cases.

Release note (bug fix): Fixed a bug where transactions that performed a SELECT
FOR UPDATE across multiple ranges but never performed writes could fail to
eagerly clean up their locks after commit. Future transactions that encountered
these abandoned locks could be delayed for 50ms before unlocking them.
@nvb nvb force-pushed the nvanbenschoten/sfuExternal branch from 966f81d to 86a5852 Compare March 9, 2023 17:24
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani)


pkg/kv/kvserver/intentresolver/intent_resolver_integration_test.go line 89 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

For this test and the one below, I know it's expected that we hit 1PC, but is there an easy way to assert that in the test?

Done. I was hopeful that I could use a query like SELECT (metrics->>'txn.commits1PC')::int FROM crdb_internal.kv_node_status from SQL, but these metrics are stale, which makes it difficult to assert that a specific operation had an effect. For now, I'm just reaching into the TxnCoordSenderFactory directly.


pkg/kv/kvserver/intentresolver/intent_resolver_integration_test.go line 114 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: Could we have a single TestAsyncIntentResolution function with individual subtests using t.Run instead? Then, this thing can just be a local function.

Yeah, there were a few possible approaches here that could have worked. I'm avoiding the local function to avoid the majority of this code being indented.

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Mar 9, 2023

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 9, 2023

This PR was included in a batch that was canceled, it will be automatically retried

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 9, 2023

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 9, 2023

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 9, 2023

Build succeeded:

@craig craig bot merged commit eb0ba11 into cockroachdb:master Mar 9, 2023
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Mar 9, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 86a5852 to blathers/backport-release-22.1-98044: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


error creating merge commit from 86a5852 to blathers/backport-release-22.2-98044: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

nvb added a commit to nvb/cockroach that referenced this pull request Mar 14, 2023
Fixes cockroachdb#98571.

This commit fixes the regression detected in cockroachdb#98571. In that issue, we saw that
the bug fix in 86a5852 (cockroachdb#98044) caused a regression in kv0 (and other benchmarks).
This was due to a bug in `kvserverbase.IntersectSpan`, which was considering
local point spans to be external to the provided range span.

This commit fixes the bug by not calling `kvserverbase.IntersectSpan` for point lock
spans.

The commit also makes the utility panic instead of silently returning incorrect results.
There's an existing TODO on the utility to generalize it. For now, we just make it harder
to misuse.

Finally, we add a test that asserts against the presence of async intent resolution after
one-phase commits when external intent resolution is not needed.

```
name                            old time/op    new time/op    delta
KV/Insert/Native/rows=1-10        61.2µs ± 3%    48.9µs ± 3%  -20.10%  (p=0.000 n=8+9)
KV/Insert/Native/rows=10-10       93.3µs ±15%    76.2µs ± 3%  -18.34%  (p=0.000 n=9+9)
KV/Insert/Native/rows=1000-10     2.84ms ±12%    2.42ms ± 4%  -14.97%  (p=0.000 n=9+9)
KV/Insert/Native/rows=100-10       365µs ± 5%     320µs ± 8%  -12.40%  (p=0.000 n=10+9)
KV/Insert/Native/rows=10000-10    27.6ms ± 6%    24.4ms ± 3%  -11.53%  (p=0.000 n=9+9)

name                            old alloc/op   new alloc/op   delta
KV/Insert/Native/rows=1000-10     4.66MB ± 1%    2.76MB ± 1%  -40.89%  (p=0.000 n=9+9)
KV/Insert/Native/rows=100-10       478kB ± 1%     287kB ± 1%  -39.90%  (p=0.000 n=10+10)
KV/Insert/Native/rows=10000-10    54.2MB ± 2%    34.3MB ± 3%  -36.73%  (p=0.000 n=10+10)
KV/Insert/Native/rows=10-10       64.2kB ± 1%    42.1kB ± 1%  -34.39%  (p=0.000 n=10+9)
KV/Insert/Native/rows=1-10        22.1kB ± 1%    17.3kB ± 1%  -21.56%  (p=0.000 n=9+10)

name                            old allocs/op  new allocs/op  delta
KV/Insert/Native/rows=1000-10      21.5k ± 0%     14.7k ± 0%  -31.70%  (p=0.000 n=8+9)
KV/Insert/Native/rows=10000-10      212k ± 0%      146k ± 0%  -31.31%  (p=0.000 n=9+10)
KV/Insert/Native/rows=100-10       2.34k ± 1%     1.61k ± 0%  -31.31%  (p=0.000 n=10+10)
KV/Insert/Native/rows=10-10          392 ± 1%       276 ± 0%  -29.59%  (p=0.000 n=8+8)
KV/Insert/Native/rows=1-10           173 ± 1%       123 ± 0%  -29.04%  (p=0.000 n=9+8)
```

Release note: None
nvb added a commit to nvb/cockroach that referenced this pull request Mar 15, 2023
Fixes cockroachdb#98571.

This commit fixes the regression detected in cockroachdb#98571. In that issue, we saw that
the bug fix in 86a5852 (cockroachdb#98044) caused a regression in kv0 (and other benchmarks).
This was due to a bug in `kvserverbase.IntersectSpan`, which was considering
local point spans to be external to the provided range span.

This commit fixes the bug by not calling `kvserverbase.IntersectSpan` for point lock
spans.

The commit also makes the utility panic instead of silently returning incorrect results.
There's an existing TODO on the utility to generalize it. For now, we just make it harder
to misuse.

Finally, we add a test that asserts against the presence of async intent resolution after
one-phase commits when external intent resolution is not needed.

```
name                            old time/op    new time/op    delta
KV/Insert/Native/rows=1-10        61.2µs ± 3%    48.9µs ± 3%  -20.10%  (p=0.000 n=8+9)
KV/Insert/Native/rows=10-10       93.3µs ±15%    76.2µs ± 3%  -18.34%  (p=0.000 n=9+9)
KV/Insert/Native/rows=1000-10     2.84ms ±12%    2.42ms ± 4%  -14.97%  (p=0.000 n=9+9)
KV/Insert/Native/rows=100-10       365µs ± 5%     320µs ± 8%  -12.40%  (p=0.000 n=10+9)
KV/Insert/Native/rows=10000-10    27.6ms ± 6%    24.4ms ± 3%  -11.53%  (p=0.000 n=9+9)

name                            old alloc/op   new alloc/op   delta
KV/Insert/Native/rows=1000-10     4.66MB ± 1%    2.76MB ± 1%  -40.89%  (p=0.000 n=9+9)
KV/Insert/Native/rows=100-10       478kB ± 1%     287kB ± 1%  -39.90%  (p=0.000 n=10+10)
KV/Insert/Native/rows=10000-10    54.2MB ± 2%    34.3MB ± 3%  -36.73%  (p=0.000 n=10+10)
KV/Insert/Native/rows=10-10       64.2kB ± 1%    42.1kB ± 1%  -34.39%  (p=0.000 n=10+9)
KV/Insert/Native/rows=1-10        22.1kB ± 1%    17.3kB ± 1%  -21.56%  (p=0.000 n=9+10)

name                            old allocs/op  new allocs/op  delta
KV/Insert/Native/rows=1000-10      21.5k ± 0%     14.7k ± 0%  -31.70%  (p=0.000 n=8+9)
KV/Insert/Native/rows=10000-10      212k ± 0%      146k ± 0%  -31.31%  (p=0.000 n=9+10)
KV/Insert/Native/rows=100-10       2.34k ± 1%     1.61k ± 0%  -31.31%  (p=0.000 n=10+10)
KV/Insert/Native/rows=10-10          392 ± 1%       276 ± 0%  -29.59%  (p=0.000 n=8+8)
KV/Insert/Native/rows=1-10           173 ± 1%       123 ± 0%  -29.04%  (p=0.000 n=9+8)
```

Release note: None
craig bot pushed a commit that referenced this pull request Mar 15, 2023
98544: colmem: allow Allocator max batch size to be customized r=cucaroach a=cucaroach

Previously this was hardcoded to coldata.BatchSize or 1024, now it
can be increased or decreased.

Epic: CRDB-18892
Informs: #91831
Release note: None


98630: kv: don't perform async intent resolution on 1PC with point lock spans r=arulajmani a=nvanbenschoten

Fixes #98571.

This commit fixes the regression detected in #98571. In that issue, we saw that the bug fix in 86a5852 (#98044) caused a regression in kv0 (and other benchmarks). This was due to a bug in `kvserverbase.IntersectSpan`, which was considering local point spans to be external to the provided range span.

This commit fixes the bug by not calling `kvserverbase.IntersectSpan` for point lock spans.

The commit also makes the utility panic instead of silently returning incorrect results. There's an existing TODO on the utility to generalize it. For now, we just make it harder to misuse.

Finally, we add a test that asserts against the presence of async intent resolution after one-phase commits when external intent resolution is not needed.

```
name                            old time/op    new time/op    delta
KV/Insert/Native/rows=1-10        61.2µs ± 3%    48.9µs ± 3%  -20.10%  (p=0.000 n=8+9)
KV/Insert/Native/rows=10-10       93.3µs ±15%    76.2µs ± 3%  -18.34%  (p=0.000 n=9+9)
KV/Insert/Native/rows=1000-10     2.84ms ±12%    2.42ms ± 4%  -14.97%  (p=0.000 n=9+9)
KV/Insert/Native/rows=100-10       365µs ± 5%     320µs ± 8%  -12.40%  (p=0.000 n=10+9)
KV/Insert/Native/rows=10000-10    27.6ms ± 6%    24.4ms ± 3%  -11.53%  (p=0.000 n=9+9)

name                            old alloc/op   new alloc/op   delta
KV/Insert/Native/rows=1000-10     4.66MB ± 1%    2.76MB ± 1%  -40.89%  (p=0.000 n=9+9)
KV/Insert/Native/rows=100-10       478kB ± 1%     287kB ± 1%  -39.90%  (p=0.000 n=10+10)
KV/Insert/Native/rows=10000-10    54.2MB ± 2%    34.3MB ± 3%  -36.73%  (p=0.000 n=10+10)
KV/Insert/Native/rows=10-10       64.2kB ± 1%    42.1kB ± 1%  -34.39%  (p=0.000 n=10+9)
KV/Insert/Native/rows=1-10        22.1kB ± 1%    17.3kB ± 1%  -21.56%  (p=0.000 n=9+10)

name                            old allocs/op  new allocs/op  delta
KV/Insert/Native/rows=1000-10      21.5k ± 0%     14.7k ± 0%  -31.70%  (p=0.000 n=8+9)
KV/Insert/Native/rows=10000-10      212k ± 0%      146k ± 0%  -31.31%  (p=0.000 n=9+10)
KV/Insert/Native/rows=100-10       2.34k ± 1%     1.61k ± 0%  -31.31%  (p=0.000 n=10+10)
KV/Insert/Native/rows=10-10          392 ± 1%       276 ± 0%  -29.59%  (p=0.000 n=8+8)
KV/Insert/Native/rows=1-10           173 ± 1%       123 ± 0%  -29.04%  (p=0.000 n=9+8)
```

Release note: None

Co-authored-by: Tommy Reilly <treilly@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@nvb nvb deleted the nvanbenschoten/sfuExternal branch April 20, 2023 17:43
nvb added a commit to nvb/cockroach that referenced this pull request Apr 20, 2023
Fixes cockroachdb#98404.

The test had begun flaking after cockroachdb#98044 because we now perform more async
intent resolution operations when starting a cluster. Specifically, we
perform additional async intent resolution operations in service of jobs
updates. These updates perform SELECT FOR UPDATE queries over the new
`system.job_info` table, but then perform a 1-phase commit.

To deflake the test, we clear the intent resolution metrics after server
startup.

Release note: None
craig bot pushed a commit that referenced this pull request Apr 20, 2023
100731: c2c: clean up ReplicationFeed error handling r=lidorcarmel a=msbutler

Previously, the replicationFeed test helper had methods that would swallow errors, making it impossible to debug certain test failures. This patch cleans up the internals of this test helper and prevents error swallowing.

Fixes #100414

Release note: None

101860: util/parquet: add support for arrays r=miretskiy a=jayshrivastava

This change extends and refactors the util/parquet library to be able to read and write arrays.

Release note: None

Informs: #99028
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071

101936: kv: deflake and unskip TestStoreResolveMetrics r=arulajmani a=nvanbenschoten

Fixes #98404.

The test had begun flaking after #98044 because we now perform more async intent resolution operations when starting a cluster. Specifically, we perform additional async intent resolution operations in service of jobs updates. These updates perform SELECT FOR UPDATE queries over the new `system.job_info` table, but then perform a 1-phase commit.

To deflake the test, we clear the intent resolution metrics after server startup.

Release note: None

Co-authored-by: Michael Butler <butler@cockroachlabs.com>
Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql, kv: SELECT FOR UPDATE appears to delay the lock cleanup on RootTxn in some cases

3 participants