kvserver: QueryResolvedTimestamp should look for intents in LockTable by shralex · Pull Request #70852 · cockroachdb/cockroach

shralex · 2021-09-29T00:39:04Z

Previously, a request called “QueryResolvedTimestamp” that is used to determine the “resolved timestamp” of a key range on a given replica cost O(num_keys_in_span), because it needed to find all intents in the span and intents were interleaved with key-value versions. A multi-release project to separate out intents from MVCC data is now far enough along that we can make this request more efficient by scanning only the lock-table keyspace, reducing its cost to O(num_locks_in_span).

Release note: None

Release justification: This is going into 22.1 where the migration to separated intents is complete

Fixes: #69717

cockroach-teamcity · 2021-09-29T00:39:10Z

This change is

sumeerbhola

Reviewed 3 of 3 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @shralex)

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 57 at r1 (raw file):

	// because QueryResolvedTimestamp requests are often run without acquiring
	// latches (see roachpb.INCONSISTENT) and often also on follower replicas,
	// so latches won't help them to synchronize with writes.

unrelated to this PR, I'm worried by this comment @nvanbenschoten

We have discussed that we would like a Reader that returns true for ConsistentIterators to contain a version of the engine from when the Reader was created. I think that would not be acceptable in this latchless situation where we want to see the engine state after the closed timestamp has been retrieved. Is my understanding correct?
Even with the current lazy creation of the first cloneable iterator, we could have multiple requests in a batch, and a request in the batch that preceded this QueryResolvedTimestampRequest could have created that iterator that we will subsequently clone. So the iterator we create in createMinIntentTimestamp could reflect the engine state prior to this closed timestamp. Is there something preventing this?

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 111 at r1 (raw file):

	// version for each key. We're only looking for MVCCMetadata versions, which
	// will always be the first version of a key if it exists, so its fine that
	// we skip over all other versions of keys.

This comment is stale.

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 136 at r1 (raw file):

		// If this is an intent, account for it.
		if meta.Txn != nil {

the case of meta.Txn == nil should now be an error.
We have two uses of MVCCMetadata that used to both be written to timestamp 0:

interleaved intents, which are now moved to the lock table. They always have a non-nil txn.
inline meta, for storing timeseries (these are not MVCC). These are still at timestamp 0 (not in the lock table), and have a nil txn.

pkg/storage/in_mem.go, line 47 at r1 (raw file):

// NewDefaultInMemForTesting allocates and returns a new, opened in-memory engine with
// the default configuration. The caller must call the engine's Close method
// when the engine is no longer needed.

Can you add to the comment here that this randomizes whether separated intents are written.

pkg/storage/in_mem.go, line 58 at r1 (raw file):

// NewDefaultInMemForStickyEngineTesting is just like NewDefaultInMemForTesting but uses
// ForStickyEngineTesting to always separate intents from MVCC data, instead of randomizing this setting.
func NewDefaultInMemForStickyEngineTesting(opts ...ConfigOption) Engine {

Let's call this something like NewInMemForTesting(enableSeparatedIntents bool, opts ...ConfigOption) and not use the term sticky-engine here since this is not specifically to do with sticky engines, and neither is the test that calls it.
In hindsight, I should not have introduced ForStickyEngineTesting given that we already have func SetSeparatedIntents(disable bool) ConfigOption. So the implementation here should avoid using ForStickyEngineTesting.

shralex

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 57 at r1 (raw file):

Previously, sumeerbhola wrote…

unrelated to this PR, I'm worried by this comment @nvanbenschoten

We have discussed that we would like a Reader that returns true for ConsistentIterators to contain a version of the engine from when the Reader was created. I think that would not be acceptable in this latchless situation where we want to see the engine state after the closed timestamp has been retrieved. Is my understanding correct?

Even with the current lazy creation of the first cloneable iterator, we could have multiple requests in a batch, and a request in the batch that preceded this QueryResolvedTimestampRequest could have created that iterator that we will subsequently clone. So the iterator we create in createMinIntentTimestamp could reflect the engine state prior to this closed timestamp. Is there something preventing this?

I'll wait for @nvanbenschoten 's input for this.

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 111 at r1 (raw file):

Previously, sumeerbhola wrote…

This comment is stale.

Done.

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 136 at r1 (raw file):

Previously, sumeerbhola wrote…

the case of meta.Txn == nil should now be an error.
We have two uses of MVCCMetadata that used to both be written to timestamp 0:

interleaved intents, which are now moved to the lock table. They always have a non-nil txn.

inline meta, for storing timeseries (these are not MVCC). These are still at timestamp 0 (not in the lock table), and have a nil txn.

Done.

pkg/storage/in_mem.go, line 47 at r1 (raw file):

Previously, sumeerbhola wrote…

Can you add to the comment here that this randomizes whether separated intents are written.

Done.

pkg/storage/in_mem.go, line 58 at r1 (raw file):

Previously, sumeerbhola wrote…

Let's call this something like NewInMemForTesting(enableSeparatedIntents bool, opts ...ConfigOption) and not use the term sticky-engine here since this is not specifically to do with sticky engines, and neither is the test that calls it.
In hindsight, I should not have introduced ForStickyEngineTesting given that we already have func SetSeparatedIntents(disable bool) ConfigOption. So the implementation here should avoid using ForStickyEngineTesting.

Done.

sumeerbhola

looks good. Couple of minor things.

Reviewed 3 of 3 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @shralex, and @sumeerbhola)

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp_test.go, line 50 at r2 (raw file):

	}

	// Setup:

can you add a comment here that with separated intents the actual key layout in the store is not what is listed below.

And if you are feeling up to it, you could add a test which manually writes a MVCCMetadata that is missing the Txn field to a lock table key, to test the error path you added. The key can be constructed by first constructing a LockTableKey then calling ToEngineKey and then putting by doing db.PutEngineKey.

pkg/storage/in_mem.go, line 57 at r2 (raw file):

// NewInMemForTesting is just like NewDefaultInMemForTesting but allows to deterministically define whether it
// separates intents from MVCC data.

We wrap comments to 80 columns and code to 100 columns, iirc. This comment line looks longer than that.
In case you are using goland, there is a goland plugin to do the comment wrapping, mentioned in our internal wiki here https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#GolandTipsandTricks-SettingupWrapToColumn

shralex

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp_test.go, line 50 at r2 (raw file):

Previously, sumeerbhola wrote…

can you add a comment here that with separated intents the actual key layout in the store is not what is listed below.

And if you are feeling up to it, you could add a test which manually writes a MVCCMetadata that is missing the Txn field to a lock table key, to test the error path you added. The key can be constructed by first constructing a LockTableKey then calling ToEngineKey and then putting by doing db.PutEngineKey.

Done.

pkg/storage/in_mem.go, line 57 at r2 (raw file):

Previously, sumeerbhola wrote…

We wrap comments to 80 columns and code to 100 columns, iirc. This comment line looks longer than that.
In case you are using goland, there is a goland plugin to do the comment wrapping, mentioned in our internal wiki here https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#GolandTipsandTricks-SettingupWrapToColumn

Done.

sumeerbhola

Reviewed 2 of 2 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)

nvb

Nice job on this.

Reviewed 1 of 3 files at r2, 2 of 2 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @shralex and @sumeerbhola)

-- commits, line 12 at r4:
Before merging this PR, we'll want to squash these three commits down into one. See https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/1411744698/Decomposing+changes+and+updating+PRs+during+reviews.

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 57 at r1 (raw file):

Previously, shralex wrote…

I'll wait for @nvanbenschoten 's input for this.

This is a very good catch. You are correct that this is improperly relying on the Reader's LSM snapshot being lazily captured at the time of the iterator creation. This will be completely broken when we begin capturing Reader snapshots eagerly and is broken today in cases where we have multiple QueryResolvedTimestampRequests in the same batch.

What's even more interesting is that this isn't actually causing issues today, for a very subtle reason. We do issue multiple QueryResolvedTimestampRequest in the same batch to the same range here, so we should hit the second problem. This means that for all but the first request in the batch, we could observe a closed timestamp that was carried in a later log entry than the last entry applied to the observed version of the state machine in computeMinIntentTimestamp. However, because of the way that these QueryResolvedTimestamp responses are combined, we happen to avoid any problems. Each response carries min(closed_ts, min_intent_ts), and we combine the responses by taking the min of all responses (min(min(closed_ts_1, min_intent_ts_1), min(closed_ts_2, min_intent_ts_2), ...)). So because the closed timestamp increases monotonically, we effectively ignore all closed timestamps except the first one and we end up with the correct resolved timestamp.

But regardless, we should fix this. My thinking about how to do that in a way that composes with #55461 is a little hazy. We've discussed the need to snapshot additional replica state for read-only operations that may evaluate after dropping latches. We could capture the pre-snapshot closed timestamp in this bundle to ensure that it is properly sequenced with the engine snapshot. But the closed timestamp isn't negligibly cheap to grab, so we probably wouldn't want to grab it if none of the requests needed it.

Since this isn't causing issues, I'm tempted to say that we should try to solve this as part of #55461, but if you have thoughts, I'd be happy to dive deeper into this now. I opened #70974 to continue the discussion.

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 124 at r4 (raw file):

		lockedKey, err := keys.DecodeLockTableSingleKey(engineKey.Key)
		if err != nil {
			continue

@sumeerbhola can you think of any valid reasons to hit this branch? Should we be wrapping and returning this error instead?

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 131 at r4 (raw file):

		}
		if meta.Txn == nil {
			return hlc.Timestamp{}, nil, errors.Errorf("nil transaction in LockTable. Key: %v, mvcc meta: %v",

s/Errorf/AssertionFailedf/

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp_test.go, line 216 at r4 (raw file):

}

func TestSeparatedIntentErrors(t *testing.T) {

nit: rename this to TestQueryResolvedTimestampErrors or something like that. This test is part of the pkg/kv/kvserver/batcheval package, so it should be sufficiently qualified.

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp_test.go, line 257 at r4 (raw file):

		_, err := QueryResolvedTimestamp(ctx, db, cArgs, &resp)
		require.Error(t, err)
		require.Regexp(t, regexp.MustCompile(`unmarshaling mvcc meta`), err.Error())

I don't think you need the QueryResolvedTimestamp. You should be able to pass a string directly.

pkg/storage/in_mem.go, line 47 at r1 (raw file):

Previously, shralex wrote…

Done.

nit: Wrap to 80 cols here as well.

pkg/storage/in_mem.go, line 58 at r1 (raw file):

Previously, shralex wrote…

Done.

Do we need this function? Couldn't a caller that wants a deterministically defined value for DisableSeparatedIntents pass an SetSeparatedIntents option to NewDefaultInMemForTesting, which will overwrite whatever value is set by ForTesting?

sumeerbhola

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten, @shralex, and @sumeerbhola)

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 57 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is a very good catch. You are correct that this is improperly relying on the Reader's LSM snapshot being lazily captured at the time of the iterator creation. This will be completely broken when we begin capturing Reader snapshots eagerly and is broken today in cases where we have multiple QueryResolvedTimestampRequests in the same batch.

What's even more interesting is that this isn't actually causing issues today, for a very subtle reason. We do issue multiple QueryResolvedTimestampRequest in the same batch to the same range here, so we should hit the second problem. This means that for all but the first request in the batch, we could observe a closed timestamp that was carried in a later log entry than the last entry applied to the observed version of the state machine in computeMinIntentTimestamp. However, because of the way that these QueryResolvedTimestamp responses are combined, we happen to avoid any problems. Each response carries min(closed_ts, min_intent_ts), and we combine the responses by taking the min of all responses (min(min(closed_ts_1, min_intent_ts_1), min(closed_ts_2, min_intent_ts_2), ...)). So because the closed timestamp increases monotonically, we effectively ignore all closed timestamps except the first one and we end up with the correct resolved timestamp.

But regardless, we should fix this. My thinking about how to do that in a way that composes with #55461 is a little hazy. We've discussed the need to snapshot additional replica state for read-only operations that may evaluate after dropping latches. We could capture the pre-snapshot closed timestamp in this bundle to ensure that it is properly sequenced with the engine snapshot. But the closed timestamp isn't negligibly cheap to grab, so we probably wouldn't want to grab it if none of the requests needed it.

Since this isn't causing issues, I'm tempted to say that we should try to solve this as part of #55461, but if you have thoughts, I'd be happy to dive deeper into this now. I opened #70974 to continue the discussion.

Deferring sounds fine.
Can we add a code comment here referring to that issue, since our current correctness is fragile.

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 124 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

@sumeerbhola can you think of any valid reasons to hit this branch? Should we be wrapping and returning this error instead?

I missed this. There is no valid reason, and we should return the error.

shralex

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @shralex, and @sumeerbhola)

-- commits, line 12 at r4:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Before merging this PR, we'll want to squash these three commits down into one. See https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/1411744698/Decomposing+changes+and+updating+PRs+during+reviews.

ack

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp_test.go, line 257 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't think you need the QueryResolvedTimestamp. You should be able to pass a string directly.

I think you meant MustCompile ? fixed

pkg/storage/in_mem.go, line 47 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: Wrap to 80 cols here as well.

Done.

pkg/storage/in_mem.go, line 58 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need this function? Couldn't a caller that wants a deterministically defined value for DisableSeparatedIntents pass an SetSeparatedIntents option to NewDefaultInMemForTesting, which will overwrite whatever value is set by ForTesting?

My understanding is that the randomization will go away, and this would be easier to use. Let me know if you feel strongly about it and I'll remove.

nvb

Reviewed 2 of 3 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @sumeerbhola)

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 57 at r1 (raw file):

Previously, sumeerbhola wrote…

Deferring sounds fine.
Can we add a code comment here referring to that issue, since our current correctness is fragile.

@shralex do you mind adding the following to this comment:

//
// NOTE: This interaction is complicated by multi-request batches, where
// the iterator snapshot is captured during this evaluation of the first
// request, but this does not currently lead to incorrectness. See
// https://github.com/cockroachdb/cockroach/issues/70974.

pkg/storage/in_mem.go, line 58 at r1 (raw file):

Previously, shralex wrote…

My understanding is that the randomization will go away, and this would be easier to use. Let me know if you feel strongly about it and I'll remove.

I don't feel strongly, so I'll defer to you and Sumeer.

Previously, a request called “QueryResolvedTimestamp” that is used to determine the “resolved timestamp” of a key range on a given replica cost O(num_keys_in_span), because it needed to find all intents in the span and intents were interleaved with key-value versions. A multi-release project to separate out intents from MVCC data is now far enough along that we can make this request more efficient by scanning only the lock-table keyspace, reducing its cost to O(num_locks_in_span). Fixes: cockroachdb#69717 Release justification: This is going into 22.1 where the migration to separated intents is complete Release note: none

shralex · 2021-10-02T16:42:42Z

bors r+

craig · 2021-10-02T17:44:33Z

Build failed:

GitHub CI (Cockroach)

shralex · 2021-10-03T02:16:31Z

bors r+

craig · 2021-10-03T03:00:15Z

Build succeeded:

GitHub CI (Cockroach)

andreimatei · 2021-10-03T13:16:11Z

This broke all the QueryResolvedTimestamp tests in race mode, with errors like cannot read undeclared span /Local/Lock/Intent"b".
Eg. #71047 and linked tests.

I'm not sure about the details, but I think that the QueryResulvedTimestamp request needs to declare that it'll read some lock-table keys here:

cockroach/pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go

Line 39 in dc0efa8

    
           RegisterReadOnlyCommand(roachpb.QueryResolvedTimestamp, DefaultDeclareKeys, QueryResolvedTimestamp)

I'm surprised that DefaultDesclareKeys doesn't declare anything about the lock table for all the other requests, but I don't know what I'm talking about.

shralex requested review from a team as code owners September 29, 2021 00:39

shralex requested review from nvb and sumeerbhola September 29, 2021 00:39

shralex mentioned this pull request Sep 29, 2021

kvserver: QueryResolvedTimestamp should look for intents in LockTable #70796

Closed

sumeerbhola requested changes Sep 29, 2021

View reviewed changes

shralex commented Sep 29, 2021

View reviewed changes

sumeerbhola requested changes Sep 29, 2021

View reviewed changes

shralex commented Sep 30, 2021

View reviewed changes

shralex force-pushed the shralex_efficient_QueryResolvedTimestamp3 branch from dc5b14a to 3d81a6f Compare September 30, 2021 19:48

sumeerbhola approved these changes Sep 30, 2021

View reviewed changes

nvb mentioned this pull request Oct 1, 2021

kv: QueryResolvedTimestamp iterator snapshot may be captured before closed timestamp #70974

Closed

nvb reviewed Oct 1, 2021

View reviewed changes

sumeerbhola reviewed Oct 1, 2021

View reviewed changes

shralex commented Oct 1, 2021

View reviewed changes

nvb approved these changes Oct 1, 2021

View reviewed changes

shralex force-pushed the shralex_efficient_QueryResolvedTimestamp3 branch from 62d34ff to dc0efa8 Compare October 2, 2021 00:34

craig bot merged commit bc42b66 into cockroachdb:master Oct 3, 2021

andreimatei mentioned this pull request Oct 3, 2021

kv/kvserver: TestServerSideBoundedStalenessNegotiation failed #71047

Closed

irfansharif mentioned this pull request Oct 4, 2021

spanconfig: introduce spanconfig.KVSubscriber #69614

Merged

Conversation

shralex commented Sep 29, 2021

Uh oh!

cockroach-teamcity commented Sep 29, 2021

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

shralex left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

shralex left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

shralex left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

shralex commented Oct 2, 2021

Uh oh!

craig bot commented Oct 2, 2021

Uh oh!

shralex commented Oct 3, 2021

Uh oh!

craig bot commented Oct 3, 2021

Uh oh!

andreimatei commented Oct 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants