kv: async resolve unreplicated locks on other ranges after one-phase commit#98044
Conversation
bd0f7a1 to
966f81d
Compare
arulajmani
left a comment
There was a problem hiding this comment.
Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/intentresolver/intent_resolver_integration_test.go line 89 at r1 (raw file):
defer leaktest.AfterTest(t)() defer log.Scope(t).Close(t)
For this test and the one below, I know it's expected that we hit 1PC, but is there an easy way to assert that in the test?
pkg/kv/kvserver/intentresolver/intent_resolver_integration_test.go line 114 at r1 (raw file):
// split into two ranges. It then asserts that all intents and locks across the // table's ranges are resolved. func testAsyncIntentResolution(t *testing.T, fn func(*gosql.DB)) {
nit: Could we have a single TestAsyncIntentResolution function with individual subtests using t.Run instead? Then, this thing can just be a local function.
…commit Fixes cockroachdb#94400. This commit fixes async lock resolution on external ranges after a transaction exercises the one-phase commit fast-path. Before the change, the 1PC fast-path was failing to handle the case where a transaction had acquired locks on ranges other than the range with the transaction record. As a result, these locks were not asynchronously but eagerly cleaned up. Conflicting transactions that encountered these locks would then wait 50ms before pushing the lock holder, finding it to be committed, and removing the locks themselves. It's fairly rare for a transaction to acquire locks on multiple ranges but then still be able to perform a 1PC commit. Regardless, we should handle such cases. Release note (bug fix): Fixed a bug where transactions that performed a SELECT FOR UPDATE across multiple ranges but never performed writes could fail to eagerly clean up their locks after commit. Future transactions that encountered these abandoned locks could be delayed for 50ms before unlocking them.
966f81d to
86a5852
Compare
nvb
left a comment
There was a problem hiding this comment.
TFTR!
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @arulajmani)
pkg/kv/kvserver/intentresolver/intent_resolver_integration_test.go line 89 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
For this test and the one below, I know it's expected that we hit 1PC, but is there an easy way to assert that in the test?
Done. I was hopeful that I could use a query like SELECT (metrics->>'txn.commits1PC')::int FROM crdb_internal.kv_node_status from SQL, but these metrics are stale, which makes it difficult to assert that a specific operation had an effect. For now, I'm just reaching into the TxnCoordSenderFactory directly.
pkg/kv/kvserver/intentresolver/intent_resolver_integration_test.go line 114 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: Could we have a single
TestAsyncIntentResolutionfunction with individual subtests usingt.Runinstead? Then, this thing can just be a local function.
Yeah, there were a few possible approaches here that could have worked. I'm avoiding the local function to avoid the majority of this code being indented.
|
bors r+ |
|
This PR was included in a batch that was canceled, it will be automatically retried |
|
Build failed (retrying...): |
|
Build failed (retrying...): |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 86a5852 to blathers/backport-release-22.1-98044: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. error creating merge commit from 86a5852 to blathers/backport-release-22.2-98044: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Fixes cockroachdb#98571. This commit fixes the regression detected in cockroachdb#98571. In that issue, we saw that the bug fix in 86a5852 (cockroachdb#98044) caused a regression in kv0 (and other benchmarks). This was due to a bug in `kvserverbase.IntersectSpan`, which was considering local point spans to be external to the provided range span. This commit fixes the bug by not calling `kvserverbase.IntersectSpan` for point lock spans. The commit also makes the utility panic instead of silently returning incorrect results. There's an existing TODO on the utility to generalize it. For now, we just make it harder to misuse. Finally, we add a test that asserts against the presence of async intent resolution after one-phase commits when external intent resolution is not needed. ``` name old time/op new time/op delta KV/Insert/Native/rows=1-10 61.2µs ± 3% 48.9µs ± 3% -20.10% (p=0.000 n=8+9) KV/Insert/Native/rows=10-10 93.3µs ±15% 76.2µs ± 3% -18.34% (p=0.000 n=9+9) KV/Insert/Native/rows=1000-10 2.84ms ±12% 2.42ms ± 4% -14.97% (p=0.000 n=9+9) KV/Insert/Native/rows=100-10 365µs ± 5% 320µs ± 8% -12.40% (p=0.000 n=10+9) KV/Insert/Native/rows=10000-10 27.6ms ± 6% 24.4ms ± 3% -11.53% (p=0.000 n=9+9) name old alloc/op new alloc/op delta KV/Insert/Native/rows=1000-10 4.66MB ± 1% 2.76MB ± 1% -40.89% (p=0.000 n=9+9) KV/Insert/Native/rows=100-10 478kB ± 1% 287kB ± 1% -39.90% (p=0.000 n=10+10) KV/Insert/Native/rows=10000-10 54.2MB ± 2% 34.3MB ± 3% -36.73% (p=0.000 n=10+10) KV/Insert/Native/rows=10-10 64.2kB ± 1% 42.1kB ± 1% -34.39% (p=0.000 n=10+9) KV/Insert/Native/rows=1-10 22.1kB ± 1% 17.3kB ± 1% -21.56% (p=0.000 n=9+10) name old allocs/op new allocs/op delta KV/Insert/Native/rows=1000-10 21.5k ± 0% 14.7k ± 0% -31.70% (p=0.000 n=8+9) KV/Insert/Native/rows=10000-10 212k ± 0% 146k ± 0% -31.31% (p=0.000 n=9+10) KV/Insert/Native/rows=100-10 2.34k ± 1% 1.61k ± 0% -31.31% (p=0.000 n=10+10) KV/Insert/Native/rows=10-10 392 ± 1% 276 ± 0% -29.59% (p=0.000 n=8+8) KV/Insert/Native/rows=1-10 173 ± 1% 123 ± 0% -29.04% (p=0.000 n=9+8) ``` Release note: None
Fixes cockroachdb#98571. This commit fixes the regression detected in cockroachdb#98571. In that issue, we saw that the bug fix in 86a5852 (cockroachdb#98044) caused a regression in kv0 (and other benchmarks). This was due to a bug in `kvserverbase.IntersectSpan`, which was considering local point spans to be external to the provided range span. This commit fixes the bug by not calling `kvserverbase.IntersectSpan` for point lock spans. The commit also makes the utility panic instead of silently returning incorrect results. There's an existing TODO on the utility to generalize it. For now, we just make it harder to misuse. Finally, we add a test that asserts against the presence of async intent resolution after one-phase commits when external intent resolution is not needed. ``` name old time/op new time/op delta KV/Insert/Native/rows=1-10 61.2µs ± 3% 48.9µs ± 3% -20.10% (p=0.000 n=8+9) KV/Insert/Native/rows=10-10 93.3µs ±15% 76.2µs ± 3% -18.34% (p=0.000 n=9+9) KV/Insert/Native/rows=1000-10 2.84ms ±12% 2.42ms ± 4% -14.97% (p=0.000 n=9+9) KV/Insert/Native/rows=100-10 365µs ± 5% 320µs ± 8% -12.40% (p=0.000 n=10+9) KV/Insert/Native/rows=10000-10 27.6ms ± 6% 24.4ms ± 3% -11.53% (p=0.000 n=9+9) name old alloc/op new alloc/op delta KV/Insert/Native/rows=1000-10 4.66MB ± 1% 2.76MB ± 1% -40.89% (p=0.000 n=9+9) KV/Insert/Native/rows=100-10 478kB ± 1% 287kB ± 1% -39.90% (p=0.000 n=10+10) KV/Insert/Native/rows=10000-10 54.2MB ± 2% 34.3MB ± 3% -36.73% (p=0.000 n=10+10) KV/Insert/Native/rows=10-10 64.2kB ± 1% 42.1kB ± 1% -34.39% (p=0.000 n=10+9) KV/Insert/Native/rows=1-10 22.1kB ± 1% 17.3kB ± 1% -21.56% (p=0.000 n=9+10) name old allocs/op new allocs/op delta KV/Insert/Native/rows=1000-10 21.5k ± 0% 14.7k ± 0% -31.70% (p=0.000 n=8+9) KV/Insert/Native/rows=10000-10 212k ± 0% 146k ± 0% -31.31% (p=0.000 n=9+10) KV/Insert/Native/rows=100-10 2.34k ± 1% 1.61k ± 0% -31.31% (p=0.000 n=10+10) KV/Insert/Native/rows=10-10 392 ± 1% 276 ± 0% -29.59% (p=0.000 n=8+8) KV/Insert/Native/rows=1-10 173 ± 1% 123 ± 0% -29.04% (p=0.000 n=9+8) ``` Release note: None
98544: colmem: allow Allocator max batch size to be customized r=cucaroach a=cucaroach Previously this was hardcoded to coldata.BatchSize or 1024, now it can be increased or decreased. Epic: CRDB-18892 Informs: #91831 Release note: None 98630: kv: don't perform async intent resolution on 1PC with point lock spans r=arulajmani a=nvanbenschoten Fixes #98571. This commit fixes the regression detected in #98571. In that issue, we saw that the bug fix in 86a5852 (#98044) caused a regression in kv0 (and other benchmarks). This was due to a bug in `kvserverbase.IntersectSpan`, which was considering local point spans to be external to the provided range span. This commit fixes the bug by not calling `kvserverbase.IntersectSpan` for point lock spans. The commit also makes the utility panic instead of silently returning incorrect results. There's an existing TODO on the utility to generalize it. For now, we just make it harder to misuse. Finally, we add a test that asserts against the presence of async intent resolution after one-phase commits when external intent resolution is not needed. ``` name old time/op new time/op delta KV/Insert/Native/rows=1-10 61.2µs ± 3% 48.9µs ± 3% -20.10% (p=0.000 n=8+9) KV/Insert/Native/rows=10-10 93.3µs ±15% 76.2µs ± 3% -18.34% (p=0.000 n=9+9) KV/Insert/Native/rows=1000-10 2.84ms ±12% 2.42ms ± 4% -14.97% (p=0.000 n=9+9) KV/Insert/Native/rows=100-10 365µs ± 5% 320µs ± 8% -12.40% (p=0.000 n=10+9) KV/Insert/Native/rows=10000-10 27.6ms ± 6% 24.4ms ± 3% -11.53% (p=0.000 n=9+9) name old alloc/op new alloc/op delta KV/Insert/Native/rows=1000-10 4.66MB ± 1% 2.76MB ± 1% -40.89% (p=0.000 n=9+9) KV/Insert/Native/rows=100-10 478kB ± 1% 287kB ± 1% -39.90% (p=0.000 n=10+10) KV/Insert/Native/rows=10000-10 54.2MB ± 2% 34.3MB ± 3% -36.73% (p=0.000 n=10+10) KV/Insert/Native/rows=10-10 64.2kB ± 1% 42.1kB ± 1% -34.39% (p=0.000 n=10+9) KV/Insert/Native/rows=1-10 22.1kB ± 1% 17.3kB ± 1% -21.56% (p=0.000 n=9+10) name old allocs/op new allocs/op delta KV/Insert/Native/rows=1000-10 21.5k ± 0% 14.7k ± 0% -31.70% (p=0.000 n=8+9) KV/Insert/Native/rows=10000-10 212k ± 0% 146k ± 0% -31.31% (p=0.000 n=9+10) KV/Insert/Native/rows=100-10 2.34k ± 1% 1.61k ± 0% -31.31% (p=0.000 n=10+10) KV/Insert/Native/rows=10-10 392 ± 1% 276 ± 0% -29.59% (p=0.000 n=8+8) KV/Insert/Native/rows=1-10 173 ± 1% 123 ± 0% -29.04% (p=0.000 n=9+8) ``` Release note: None Co-authored-by: Tommy Reilly <treilly@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Fixes cockroachdb#98404. The test had begun flaking after cockroachdb#98044 because we now perform more async intent resolution operations when starting a cluster. Specifically, we perform additional async intent resolution operations in service of jobs updates. These updates perform SELECT FOR UPDATE queries over the new `system.job_info` table, but then perform a 1-phase commit. To deflake the test, we clear the intent resolution metrics after server startup. Release note: None
100731: c2c: clean up ReplicationFeed error handling r=lidorcarmel a=msbutler Previously, the replicationFeed test helper had methods that would swallow errors, making it impossible to debug certain test failures. This patch cleans up the internals of this test helper and prevents error swallowing. Fixes #100414 Release note: None 101860: util/parquet: add support for arrays r=miretskiy a=jayshrivastava This change extends and refactors the util/parquet library to be able to read and write arrays. Release note: None Informs: #99028 Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071 101936: kv: deflake and unskip TestStoreResolveMetrics r=arulajmani a=nvanbenschoten Fixes #98404. The test had begun flaking after #98044 because we now perform more async intent resolution operations when starting a cluster. Specifically, we perform additional async intent resolution operations in service of jobs updates. These updates perform SELECT FOR UPDATE queries over the new `system.job_info` table, but then perform a 1-phase commit. To deflake the test, we clear the intent resolution metrics after server startup. Release note: None Co-authored-by: Michael Butler <butler@cockroachlabs.com> Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Fixes #94400.
This commit fixes async lock resolution on external ranges after a transaction exercises the one-phase commit fast-path. Before the change, the 1PC fast-path was failing to handle the case where a transaction had acquired locks on ranges other than the range with the transaction record. As a result, these locks were not asynchronously but eagerly cleaned up. Conflicting transactions that encountered these locks would then wait 50ms before pushing the lock holder, finding it to be committed, and removing the locks themselves.
It's fairly rare for a transaction to acquire locks on multiple ranges but then still be able to perform a 1PC commit. Regardless, we should handle such cases.
Release note (bug fix): Fixed a bug where transactions that performed a SELECT FOR UPDATE across multiple ranges but never performed writes could fail to eagerly clean up their locks after commit. Future transactions that encountered these abandoned locks could be delayed for 50ms before unlocking them.
@KaiSun314 I'm adding you as an optional reviewer here because this PR extends some of the new testing that you added for intent resolution.