kvserver: QueryResolvedTimestamp should look for intents in LockTable#70852
Conversation
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewed 3 of 3 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @shralex)
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 57 at r1 (raw file):
// because QueryResolvedTimestamp requests are often run without acquiring // latches (see roachpb.INCONSISTENT) and often also on follower replicas, // so latches won't help them to synchronize with writes.
unrelated to this PR, I'm worried by this comment @nvanbenschoten
- We have discussed that we would like a
Readerthat returns true forConsistentIteratorsto contain a version of the engine from when theReaderwas created. I think that would not be acceptable in this latchless situation where we want to see the engine state after the closed timestamp has been retrieved. Is my understanding correct? - Even with the current lazy creation of the first cloneable iterator, we could have multiple requests in a batch, and a request in the batch that preceded this
QueryResolvedTimestampRequestcould have created that iterator that we will subsequently clone. So the iterator we create increateMinIntentTimestampcould reflect the engine state prior to this closed timestamp. Is there something preventing this?
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 111 at r1 (raw file):
// version for each key. We're only looking for MVCCMetadata versions, which // will always be the first version of a key if it exists, so its fine that // we skip over all other versions of keys.
This comment is stale.
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 136 at r1 (raw file):
// If this is an intent, account for it. if meta.Txn != nil {
the case of meta.Txn == nil should now be an error.
We have two uses of MVCCMetadata that used to both be written to timestamp 0:
- interleaved intents, which are now moved to the lock table. They always have a non-nil txn.
- inline meta, for storing timeseries (these are not MVCC). These are still at timestamp 0 (not in the lock table), and have a nil txn.
pkg/storage/in_mem.go, line 47 at r1 (raw file):
// NewDefaultInMemForTesting allocates and returns a new, opened in-memory engine with // the default configuration. The caller must call the engine's Close method // when the engine is no longer needed.
Can you add to the comment here that this randomizes whether separated intents are written.
pkg/storage/in_mem.go, line 58 at r1 (raw file):
// NewDefaultInMemForStickyEngineTesting is just like NewDefaultInMemForTesting but uses // ForStickyEngineTesting to always separate intents from MVCC data, instead of randomizing this setting. func NewDefaultInMemForStickyEngineTesting(opts ...ConfigOption) Engine {
Let's call this something like NewInMemForTesting(enableSeparatedIntents bool, opts ...ConfigOption) and not use the term sticky-engine here since this is not specifically to do with sticky engines, and neither is the test that calls it.
In hindsight, I should not have introduced ForStickyEngineTesting given that we already have func SetSeparatedIntents(disable bool) ConfigOption. So the implementation here should avoid using ForStickyEngineTesting.
shralex
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 57 at r1 (raw file):
Previously, sumeerbhola wrote…
unrelated to this PR, I'm worried by this comment @nvanbenschoten
- We have discussed that we would like a
Readerthat returns true forConsistentIteratorsto contain a version of the engine from when theReaderwas created. I think that would not be acceptable in this latchless situation where we want to see the engine state after the closed timestamp has been retrieved. Is my understanding correct?- Even with the current lazy creation of the first cloneable iterator, we could have multiple requests in a batch, and a request in the batch that preceded this
QueryResolvedTimestampRequestcould have created that iterator that we will subsequently clone. So the iterator we create increateMinIntentTimestampcould reflect the engine state prior to this closed timestamp. Is there something preventing this?
I'll wait for @nvanbenschoten 's input for this.
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 111 at r1 (raw file):
Previously, sumeerbhola wrote…
This comment is stale.
Done.
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 136 at r1 (raw file):
Previously, sumeerbhola wrote…
the case of
meta.Txn == nilshould now be an error.
We have two uses ofMVCCMetadatathat used to both be written to timestamp 0:
- interleaved intents, which are now moved to the lock table. They always have a non-nil txn.
- inline meta, for storing timeseries (these are not MVCC). These are still at timestamp 0 (not in the lock table), and have a nil txn.
Done.
pkg/storage/in_mem.go, line 47 at r1 (raw file):
Previously, sumeerbhola wrote…
Can you add to the comment here that this randomizes whether separated intents are written.
Done.
pkg/storage/in_mem.go, line 58 at r1 (raw file):
Previously, sumeerbhola wrote…
Let's call this something like
NewInMemForTesting(enableSeparatedIntents bool, opts ...ConfigOption)and not use the term sticky-engine here since this is not specifically to do with sticky engines, and neither is the test that calls it.
In hindsight, I should not have introducedForStickyEngineTestinggiven that we already havefunc SetSeparatedIntents(disable bool) ConfigOption. So the implementation here should avoid usingForStickyEngineTesting.
Done.
sumeerbhola
left a comment
There was a problem hiding this comment.
looks good. Couple of minor things.
Reviewed 3 of 3 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @shralex, and @sumeerbhola)
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp_test.go, line 50 at r2 (raw file):
} // Setup:
can you add a comment here that with separated intents the actual key layout in the store is not what is listed below.
And if you are feeling up to it, you could add a test which manually writes a MVCCMetadata that is missing the Txn field to a lock table key, to test the error path you added. The key can be constructed by first constructing a LockTableKey then calling ToEngineKey and then putting by doing db.PutEngineKey.
pkg/storage/in_mem.go, line 57 at r2 (raw file):
// NewInMemForTesting is just like NewDefaultInMemForTesting but allows to deterministically define whether it // separates intents from MVCC data.
We wrap comments to 80 columns and code to 100 columns, iirc. This comment line looks longer than that.
In case you are using goland, there is a goland plugin to do the comment wrapping, mentioned in our internal wiki here https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#GolandTipsandTricks-SettingupWrapToColumn
shralex
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp_test.go, line 50 at r2 (raw file):
Previously, sumeerbhola wrote…
can you add a comment here that with separated intents the actual key layout in the store is not what is listed below.
And if you are feeling up to it, you could add a test which manually writes a
MVCCMetadatathat is missing the Txn field to a lock table key, to test the error path you added. The key can be constructed by first constructing aLockTableKeythen callingToEngineKeyand then putting by doingdb.PutEngineKey.
Done.
pkg/storage/in_mem.go, line 57 at r2 (raw file):
Previously, sumeerbhola wrote…
We wrap comments to 80 columns and code to 100 columns, iirc. This comment line looks longer than that.
In case you are using goland, there is a goland plugin to do the comment wrapping, mentioned in our internal wiki here https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#GolandTipsandTricks-SettingupWrapToColumn
Done.
dc5b14a to
3d81a6f
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)
nvb
left a comment
There was a problem hiding this comment.
Nice job on this.
Reviewed 1 of 3 files at r2, 2 of 2 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @shralex and @sumeerbhola)
-- commits, line 12 at r4:
Before merging this PR, we'll want to squash these three commits down into one. See https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/1411744698/Decomposing+changes+and+updating+PRs+during+reviews.
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 57 at r1 (raw file):
Previously, shralex wrote…
I'll wait for @nvanbenschoten 's input for this.
This is a very good catch. You are correct that this is improperly relying on the Reader's LSM snapshot being lazily captured at the time of the iterator creation. This will be completely broken when we begin capturing Reader snapshots eagerly and is broken today in cases where we have multiple QueryResolvedTimestampRequests in the same batch.
What's even more interesting is that this isn't actually causing issues today, for a very subtle reason. We do issue multiple QueryResolvedTimestampRequest in the same batch to the same range here, so we should hit the second problem. This means that for all but the first request in the batch, we could observe a closed timestamp that was carried in a later log entry than the last entry applied to the observed version of the state machine in computeMinIntentTimestamp. However, because of the way that these QueryResolvedTimestamp responses are combined, we happen to avoid any problems. Each response carries min(closed_ts, min_intent_ts), and we combine the responses by taking the min of all responses (min(min(closed_ts_1, min_intent_ts_1), min(closed_ts_2, min_intent_ts_2), ...)). So because the closed timestamp increases monotonically, we effectively ignore all closed timestamps except the first one and we end up with the correct resolved timestamp.
But regardless, we should fix this. My thinking about how to do that in a way that composes with #55461 is a little hazy. We've discussed the need to snapshot additional replica state for read-only operations that may evaluate after dropping latches. We could capture the pre-snapshot closed timestamp in this bundle to ensure that it is properly sequenced with the engine snapshot. But the closed timestamp isn't negligibly cheap to grab, so we probably wouldn't want to grab it if none of the requests needed it.
Since this isn't causing issues, I'm tempted to say that we should try to solve this as part of #55461, but if you have thoughts, I'd be happy to dive deeper into this now. I opened #70974 to continue the discussion.
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 124 at r4 (raw file):
lockedKey, err := keys.DecodeLockTableSingleKey(engineKey.Key) if err != nil { continue
@sumeerbhola can you think of any valid reasons to hit this branch? Should we be wrapping and returning this error instead?
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 131 at r4 (raw file):
} if meta.Txn == nil { return hlc.Timestamp{}, nil, errors.Errorf("nil transaction in LockTable. Key: %v, mvcc meta: %v",
s/Errorf/AssertionFailedf/
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp_test.go, line 216 at r4 (raw file):
} func TestSeparatedIntentErrors(t *testing.T) {
nit: rename this to TestQueryResolvedTimestampErrors or something like that. This test is part of the pkg/kv/kvserver/batcheval package, so it should be sufficiently qualified.
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp_test.go, line 257 at r4 (raw file):
_, err := QueryResolvedTimestamp(ctx, db, cArgs, &resp) require.Error(t, err) require.Regexp(t, regexp.MustCompile(`unmarshaling mvcc meta`), err.Error())
I don't think you need the QueryResolvedTimestamp. You should be able to pass a string directly.
pkg/storage/in_mem.go, line 47 at r1 (raw file):
Previously, shralex wrote…
Done.
nit: Wrap to 80 cols here as well.
pkg/storage/in_mem.go, line 58 at r1 (raw file):
Previously, shralex wrote…
Done.
Do we need this function? Couldn't a caller that wants a deterministically defined value for DisableSeparatedIntents pass an SetSeparatedIntents option to NewDefaultInMemForTesting, which will overwrite whatever value is set by ForTesting?
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten, @shralex, and @sumeerbhola)
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 57 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This is a very good catch. You are correct that this is improperly relying on the Reader's LSM snapshot being lazily captured at the time of the iterator creation. This will be completely broken when we begin capturing Reader snapshots eagerly and is broken today in cases where we have multiple
QueryResolvedTimestampRequests in the same batch.What's even more interesting is that this isn't actually causing issues today, for a very subtle reason. We do issue multiple
QueryResolvedTimestampRequestin the same batch to the same range here, so we should hit the second problem. This means that for all but the first request in the batch, we could observe a closed timestamp that was carried in a later log entry than the last entry applied to the observed version of the state machine incomputeMinIntentTimestamp. However, because of the way that theseQueryResolvedTimestampresponses are combined, we happen to avoid any problems. Each response carriesmin(closed_ts, min_intent_ts), and we combine the responses by taking the min of all responses (min(min(closed_ts_1, min_intent_ts_1), min(closed_ts_2, min_intent_ts_2), ...)). So because the closed timestamp increases monotonically, we effectively ignore all closed timestamps except the first one and we end up with the correct resolved timestamp.But regardless, we should fix this. My thinking about how to do that in a way that composes with #55461 is a little hazy. We've discussed the need to snapshot additional replica state for read-only operations that may evaluate after dropping latches. We could capture the pre-snapshot closed timestamp in this bundle to ensure that it is properly sequenced with the engine snapshot. But the closed timestamp isn't negligibly cheap to grab, so we probably wouldn't want to grab it if none of the requests needed it.
Since this isn't causing issues, I'm tempted to say that we should try to solve this as part of #55461, but if you have thoughts, I'd be happy to dive deeper into this now. I opened #70974 to continue the discussion.
Deferring sounds fine.
Can we add a code comment here referring to that issue, since our current correctness is fragile.
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 124 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
@sumeerbhola can you think of any valid reasons to hit this branch? Should we be wrapping and returning this error instead?
I missed this. There is no valid reason, and we should return the error.
shralex
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @shralex, and @sumeerbhola)
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Before merging this PR, we'll want to squash these three commits down into one. See https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/1411744698/Decomposing+changes+and+updating+PRs+during+reviews.
ack
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp_test.go, line 257 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I don't think you need the
QueryResolvedTimestamp. You should be able to pass a string directly.
I think you meant MustCompile ? fixed
pkg/storage/in_mem.go, line 47 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: Wrap to 80 cols here as well.
Done.
pkg/storage/in_mem.go, line 58 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we need this function? Couldn't a caller that wants a deterministically defined value for
DisableSeparatedIntentspass anSetSeparatedIntentsoption toNewDefaultInMemForTesting, which will overwrite whatever value is set byForTesting?
My understanding is that the randomization will go away, and this would be easier to use. Let me know if you feel strongly about it and I'll remove.
nvb
left a comment
There was a problem hiding this comment.
Reviewed 2 of 3 files at r5, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @sumeerbhola)
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 57 at r1 (raw file):
Previously, sumeerbhola wrote…
Deferring sounds fine.
Can we add a code comment here referring to that issue, since our current correctness is fragile.
@shralex do you mind adding the following to this comment:
//
// NOTE: This interaction is complicated by multi-request batches, where
// the iterator snapshot is captured during this evaluation of the first
// request, but this does not currently lead to incorrectness. See
// https://github.com/cockroachdb/cockroach/issues/70974.pkg/storage/in_mem.go, line 58 at r1 (raw file):
Previously, shralex wrote…
My understanding is that the randomization will go away, and this would be easier to use. Let me know if you feel strongly about it and I'll remove.
I don't feel strongly, so I'll defer to you and Sumeer.
Previously, a request called “QueryResolvedTimestamp” that is used to determine the “resolved timestamp” of a key range on a given replica cost O(num_keys_in_span), because it needed to find all intents in the span and intents were interleaved with key-value versions. A multi-release project to separate out intents from MVCC data is now far enough along that we can make this request more efficient by scanning only the lock-table keyspace, reducing its cost to O(num_locks_in_span). Fixes: cockroachdb#69717 Release justification: This is going into 22.1 where the migration to separated intents is complete Release note: none
62d34ff to
dc0efa8
Compare
|
bors r+ |
|
Build failed: |
|
bors r+ |
|
Build succeeded: |
|
This broke all the I'm not sure about the details, but I think that the I'm surprised that DefaultDesclareKeys doesn't declare anything about the lock table for all the other requests, but I don't know what I'm talking about.
|
Previously, a request called “QueryResolvedTimestamp” that is used to determine the “resolved timestamp” of a key range on a given replica cost O(num_keys_in_span), because it needed to find all intents in the span and intents were interleaved with key-value versions. A multi-release project to separate out intents from MVCC data is now far enough along that we can make this request more efficient by scanning only the lock-table keyspace, reducing its cost to O(num_locks_in_span).
Release note: None
Release justification: This is going into 22.1 where the migration to separated intents is complete
Fixes: #69717