Skip to content

kvclient: handle speculative descriptors in the range cache better#51395

Merged
craig[bot] merged 4 commits intocockroachdb:masterfrom
andreimatei:small.range-cache-next-desc-speculative
Aug 10, 2020
Merged

kvclient: handle speculative descriptors in the range cache better#51395
craig[bot] merged 4 commits intocockroachdb:masterfrom
andreimatei:small.range-cache-next-desc-speculative

Conversation

@andreimatei
Copy link
Copy Markdown
Contributor

@andreimatei andreimatei commented Jul 13, 2020

kvclient: handle speculative descriptors in the range cache better …

This patch fixes an assertion in the range cache that would sometimes
trigger, thinking that it has found inconsistent descriptors - two
overlapping descriptors with the same generation, but otherwise
different. This shouldn't happen, at least in the opinion of the
assertion. But it can happen in case one of the two descriptors is
"speculative": an eviction token for the range cache can contain a
replacement descriptor which comes from an intent. That intent might
never commit, in which case the respective generation might eventually
be assigned to some other, different descriptor.

This patch improves the cache's handling of speculative descriptors, and
also speculative leases, by handling them more explicitly. Such
descriptors are now inserted into the cache with Generation=0, and our
descriptor comparison code learns to deal explicitly with them.

Fixes #50995

Release note: None

@andreimatei andreimatei requested a review from nvb July 13, 2020 21:52
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the motivation for this change. Before making the change, I'd like to better understand how things will work with and without it. Specifically, what will happen to a RangeCache that contains a descriptor coming from an intent in the following three cases:

  1. the split completes successfully and resolves the intent
  2. the split fails and rolls back the intent
  3. the split stalls while continuing to heartbeat its txn record

In all three cases, we want to minimize the disruption that the RangeCache will have on foreground traffic trying to access the data on the splitting range.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all 3 cases, the intent descriptor will be put in the cache upon eviction of the original (modulo the problem that said eviction needs to happen specifically through the EvictionToken coming from the original insertion of the pre-split descriptor; evictions resulting from other cache operations won't do).
This patch does not change whether the speculative descriptor makes it into the cache or not. It just changes how vulnerable it is once it made it into the cache. Before, it would make it into the cache with its speculative Generation. Which is bad in case 2 (intent getting rolled back) because then the future real descriptor with that Generation will have trouble getting into the cache because it won't be NewerThan the speculative one. Most likely what will happen is that the insertion of the future descriptor will panic because it will detect a non-copacetic desc with the same generation:

panic(fmt.Sprintf("overlapping descriptors with same gen but different IDs: %s vs %s",

This patch fixes this by inserting the post-split descriptor with a Generation of 0, which makes the respective cache entry look older than any would-be replacement - in particular than the replacement with the same Generation as the rolled-back intent, or even the original descriptor (with a lower generation) if it turns out that we have wrongly evicted it.
This works for the split case, but not for the merge case. For merges, let's say we had [a,b) being merged with [b,c). Once we evict [a,b) we'll try to insert the speculative [a,c). Inserting [a,c) with a generation of 0 will fail if we have [b,c) in the cache. This seems funky, but the current behavior (evicting [b,c)) also seems funky to me. It's one thing to replace empty space in the cache with a speculative descriptor, it's another thing to replace an entry like [b,c) which might be fine for all we know. But I could be convinced otherwise, and we could fix it by finding another way of tracking the "speculative" character of [a,c).

If you buy this, I can update the commit msg.
But I'm interested in discussing the possibility of getting out of this speculation game, just because it seems complex to me and I'm scared of it. In what circumstances does the current speculation actually help? It only helps nodes that read meta2 during the split/merge txn. Which should be ~nobody, right? I think a better way to minimize the impact of splits/merges is to gossip the descriptor updates (as @knz and I keep talking about doing). If we do that gossiping, then I guess we have an analogous question about whether we should gossip speculatively before the split commit, or after the commit. But I'd just start with after and take it from there.
On the other hand, I was thinking that, perhaps, the point of the current speculative descriptor is not really to help with minimizing the impact of splits on traffic; instead maybe it's required for splits to work in the first place - I kinda remember something about these transactions relying on these intents in a subtle way (perhaps for cleaning up the intent itself because, after the commit, only the intent has the right descriptor in it?). Do you know anything about this?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@nvb
Copy link
Copy Markdown
Contributor

nvb commented Jul 20, 2020

On the other hand, I was thinking that, perhaps, the point of the current speculative descriptor is not really to help with minimizing the impact of splits on traffic; instead maybe it's required for splits to work in the first place - I kinda remember something about these transactions relying on these intents in a subtle way (perhaps for cleaning up the intent itself because, after the commit, only the intent has the right descriptor in it?). Do you know anything about this?

I don't recall the exact details, but it is the case that the speculative descriptors are required for splits to work correctly. I think you're right that it has to do with resolving the range-local intent on the RHS of the split after the split has committed. I wonder if any tests fail if you stop returning intents from READ_UNCOMMITTED scans.

I do support being more clear about when range descriptor information is speculative and when it's not. Being clear about that distinction seems to open up more room for being smart about caching and generally performing incremental cache updates like you're doing in #51437.

My concern is mainly that this distinction is not explicit right now and also that the semantics around generation 0 vs. generation non-0 are not spelled out anywhere, so I'm hesitant to begin relying on them.

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you'd want an explicit speculative flag in the RangeCacheEntry? Can do, but then I guess that forces our hand to have a way of resetting the speculative flag too when we know that the speculation has worked out. I guess this part becomes easy with #51437 when a successful RPC either returns an updated descriptor or implicitly confirms that the caller had it right. So I guess I can piggy-back an EvictionToken.Confirm() method on that PR - which would be a no-op for non-speculative descriptors and an eviction and reinsertion as non-speculative for the rest. Sounds like a plan?

Btw, @aayushshah15 demonstrated badness caused by the issue this PR is addressing using the kvnemesis with merges enabled. And I'm now pretty sure that this is also the cause of the assertion failure seen in #50995. So I guess I should write a test.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@aayushshah15
Copy link
Copy Markdown
Contributor

aayushshah15 commented Jul 20, 2020

but it is the case that the speculative descriptors are required for splits to work correctly.

Lazy question, is this documented somewhere? If not, we should probably have a detailed section about this in the comments.

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably all we have is this comment:

// or the previous value points to the correct location of the Range. It
// gets even more complicated when there are split-related intents or a txn
// record co-located with a replica involved in the split. Since we cannot
// know the correct answer, we lookup both the pre- and post- transaction
// values.

Note to self: what I've discussed with Nathan is adding some type safety around speculative range cache entries by having the cache contain an interface as elements, with two implementations and a comparison function with a type switch.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@andreimatei andreimatei force-pushed the small.range-cache-next-desc-speculative branch 2 times, most recently from c5a412e to b06660d Compare July 26, 2020 22:09
@andreimatei
Copy link
Copy Markdown
Contributor Author

This is RFAL. I've given up on splitting speculative vs regular cache entries into different types because I became unconvinced it was helping everybody, but I've done other stuff to bring speculation more front and center.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest round is much better! Promoting speculative descriptors and leases to be front and center makes this all much easier to reason about.

Reviewed 5 of 5 files at r1, 2 of 2 files at r2, 2 of 2 files at r3, 13 of 13 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/kv/kvbase/range_cache.go, line 35 at r4 (raw file):

	// DescSpeculative returns true if the descriptor in the entry is "speculative"
	// - i.e. it doesn't correspond to a committed value. Such descriptors have been
	// inserted in the cache with Generation=0.

Add a sentence describing how this is possible (i.e. that they were taken from intents).


pkg/kv/kvbase/range_cache.go, line 42 at r4 (raw file):

	Leaseholder() *roachpb.ReplicaDescriptor

	// Lease returns the cached lease, if known. Returns nil if no lease is known. It's

nit: rewrap some of these lines.


pkg/kv/kvclient/kvcoord/dist_sender.go, line 1749 at r4 (raw file):

				replicas.MoveToFront(idx)
			} else {
				log.Eventf(ctx, "leaseholder %s missing from replicas", leaseholder)

Is this possible? Do we want it to be?


pkg/kv/kvclient/kvcoord/dist_sender.go, line 1809 at r4 (raw file):

			// "speculative". Even if the speculation is correct, we want the serve to
			// return an update, at which point the cached entry will no longer be
			// "speculative".

nit: reference the DescSpeculative() method here.


pkg/kv/kvclient/kvcoord/range_cache.go, line 226 at r4 (raw file):

	// the opportunity to use this nextDesc; if another actor races to evict the
	// respective cache entry and wins, nextDesc becomes useless.
	nextDesc *roachpb.RangeDescriptor

Consider renaming this here and elsewhere to "speculativeDesc" and adding a bit of a comment describing its role.


pkg/kv/kvclient/kvcoord/range_cache.go, line 314 at r4 (raw file):

//
// It's legal to pass in a lease with a zero Sequence; it will be treated as a
// speculative lease and considered newer than any existing lease.

But will be immediately replaced by any update, right?


pkg/kv/kvclient/kvcoord/range_cache.go, line 836 at r4 (raw file):

		if cachedEntry != nil && descsCompatible(cachedEntry.Desc(), entry.Desc()) {
			return false, cachedEntry
			//// cachedEntry is almost certainly newer than entry, but we still check

Did you mean to remove this?


pkg/kv/kvclient/kvcoord/range_cache.go, line 1106 at r4 (raw file):

// - i.e. it doesn't correspond to a committed lease. Such leases have been
// inserted in the cache with Sequence=0.
func (e *rangeCacheEntry) leaseSpeculative() bool {

Is it worth adding this to the interface as well? If only to show the symmetry to between DescSpeculative and LeaseSpeculative?


pkg/kv/kvclient/kvcoord/range_cache.go, line 1125 at r4 (raw file):

// can't be determined what information is newer is when at least one of the
// descriptors is "speculative" (generation=0), or when the lease information is
// speculative.

nit: might as well give this word the same treatment: "speculative (sequence=0).


pkg/roachpb/errors.go, line 424 at r1 (raw file):

		return e.rangesInternal
	}
	// Fallback for 20.1 errors. Remove in 20.3.

Giving up hope?


pkg/roachpb/errors.proto, line 41 at r3 (raw file):

  //
  // It's possible for leases returned here to represent speculative leases, not
  // actual committed leasea. In this case, the lease will not have its Sequence

s/leasea/leases/


pkg/roachpb/errors.proto, line 42 at r3 (raw file):

  // It's possible for leases returned here to represent speculative leases, not
  // actual committed leasea. In this case, the lease will not have its Sequence
  // set.

Should we codify this rule in a (Lease) Speculative method?

Also, do we need to worry about leases without sequence numbers on legacy clusters? I don't think so because it seems like any active lease will have begun incrementing the lease sequence a few releases ago, but it's worth going through the thought experiment.


pkg/sql/physicalplan/replicaoracle/oracle.go, line 71 at r1 (raw file):

	// about any of the nodes that might be tried.
	ChoosePreferredReplica(
		ctx context.Context, rng *roachpb.RangeDescriptor, leaseHolder *roachpb.ReplicaDescriptor, qState QueryState,

nit: is leaseholder one word or two? I don't have strong preferences, but let's be consistent. What do we do elsewhere?

The replica oracles only need a leaseholder, not a full lease. This
patch a more narrow interface in the RangeIterator for exposing just
this instead of the fatter lease. This works towards not exposing leases
any more outside of the range cache since the range cache internally
deals with "speculative leases" hackily and we'd rather keep these
incomplete leases for escaping to clients.

Release note: None
RangeIterator.Leaseholder() remains. Lease() had only one rando caller
left that doesn't matter much.
This patch is towards not exposing lease info from the
RangeDescriptorCache in order to not burden clients with details about
speculative vs non-speculative leases.

Release note: None
This patch fixes an assertion in the range cache that would sometimes
trigger, thinking that it has found inconsistent descriptors - two
overlapping descriptors with the same generation, but otherwise
different. This shouldn't happen, at least in the opinion of the
assertion. But it can happen in case one of the two descriptors is
"speculative": an eviction token for the range cache can contain a
replacement descriptor which comes from an intent. That intent might
never commit, in which case the respective generation might eventually
be assigned to some other, different descriptor.

The interface presented by the cache to packages outside KV
(kvbase.RangeCacheEntry) is improved; the implementation of
RangeCacheEntry is moved to the kv package, and an interface is left
behind that talks about speculative data.

This patch improves the cache's handling of speculative descriptors, and
also speculative leases, by handling them more explicitly. Such
descriptors are now inserted into the cache with Generation=0, and our
descriptor comparison code learns to deal explicitly with them.

Fixes cockroachdb#50995

Release note: None
@andreimatei andreimatei force-pushed the small.range-cache-next-desc-speculative branch from c284fe4 to 6fc9d61 Compare August 4, 2020 19:41
Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvbase/range_cache.go, line 35 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Add a sentence describing how this is possible (i.e. that they were taken from intents).

done


pkg/kv/kvbase/range_cache.go, line 42 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: rewrap some of these lines.

done


pkg/kv/kvclient/kvcoord/dist_sender.go, line 1749 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this possible? Do we want it to be?

I think so. Added a comment explaining how it can happen.


pkg/kv/kvclient/kvcoord/dist_sender.go, line 1809 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: reference the DescSpeculative() method here.

done


pkg/kv/kvclient/kvcoord/range_cache.go, line 226 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider renaming this here and elsewhere to "speculativeDesc" and adding a bit of a comment describing its role.

see now


pkg/kv/kvclient/kvcoord/range_cache.go, line 314 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

But will be immediately replaced by any update, right?

expanded the comment


pkg/kv/kvclient/kvcoord/range_cache.go, line 836 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did you mean to remove this?

yeah, thanks
I've removed it cause I thought handling it's not worth the code (and it was giving me some other trouble too I think)


pkg/kv/kvclient/kvcoord/range_cache.go, line 1106 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is it worth adding this to the interface as well? If only to show the symmetry to between DescSpeculative and LeaseSpeculative?

done

but btw I'm gonna move the range cache to its own package and get rid of the RangeCacheEntry iface in kvbase


pkg/kv/kvclient/kvcoord/range_cache.go, line 1125 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: might as well give this word the same treatment: "speculative (sequence=0).

done


pkg/roachpb/errors.go, line 424 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Giving up hope?

:)


pkg/roachpb/errors.proto, line 41 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/leasea/leases/

done


pkg/roachpb/errors.proto, line 42 at r3 (raw file):

Should we codify this rule in a (Lease) Speculative method?

done, and used the new method in the next commit

Also, do we need to worry about leases without sequence numbers on legacy clusters? I don't think so because it seems like any active lease will have begun incrementing the lease sequence a few releases ago, but it's worth going through the thought experiment.

yeah, I've thought about this and I don't think we need to worry about it. One thing that I've tested is that a new cluster doesn't produce leases with seq=0 even immediately after bootstrapping, and it doesn't.


pkg/sql/physicalplan/replicaoracle/oracle.go, line 71 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: is leaseholder one word or two? I don't have strong preferences, but let's be consistent. What do we do elsewhere?

one word is more prevalent (at least when there's not other words in the variable name). Switched to that.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 3 of 18 files at r5, 2 of 3 files at r7, 13 of 13 files at r8.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/roachpb/errors.proto, line 42 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Should we codify this rule in a (Lease) Speculative method?

done, and used the new method in the next commit

Also, do we need to worry about leases without sequence numbers on legacy clusters? I don't think so because it seems like any active lease will have begun incrementing the lease sequence a few releases ago, but it's worth going through the thought experiment.

yeah, I've thought about this and I don't think we need to worry about it. One thing that I've tested is that a new cluster doesn't produce leases with seq=0 even immediately after bootstrapping, and it doesn't.

Thanks for running that test. The moment after bootstrapping was also concerning me.

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 10, 2020

Build succeeded:

@craig craig bot merged commit dd88cc2 into cockroachdb:master Aug 10, 2020
@andreimatei andreimatei deleted the small.range-cache-next-desc-speculative branch August 21, 2020 14:45
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Sep 1, 2020
I `stressrace`ed it for ~4K runs and couldn't repro.

```
4456 runs so far, 0 failures, over 28m5s
```

In all likelihood, this test was fixed by @andreimatei's changes around
the range descriptor cache. In particular, cockroachdb#51437 and/or cockroachdb#51395.

Release justification: Non production code change
Release note: None

Fixes cockroachdb#50795
craig bot pushed a commit that referenced this pull request Sep 1, 2020
53667: backupccl: fix database resolution for revision_history cluster backups r=dt a=pbardea

Previously database descriptor changes were not tracked when performing
cluster backups with revision history.

Fixes #52392.

Release justification: bug fix
Release note (bug fix): Database creation/deletion was previously not
correctly tracked by revision_history cluster backups. This is now
fixed.

53692: sql: fix enums in `pg_catalog.pg_attrdef` r=rohany a=rohany

Fixes #53687.

This commit ensures that enums are displayed with their logical
representation in the catalog table `pg_catalog.pg_attrdef`.

Release justification: bug fix
Release note: None

53777: kvserver: unskip TestStoreRangeMergeConcurrentRequests r=aayushshah15 a=aayushshah15

I `stressrace`ed it for ~4K runs and couldn't repro.

```
4456 runs so far, 0 failures, over 28m5s
```

In all likelihood, this test was fixed by @andreimatei's changes around
the range descriptor cache. In particular, #51437 and/or #51395.

Fixes #50795

Release justification: Non production code change
Release note: None

Co-authored-by: Paul Bardea <pbardea@gmail.com>
Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ccl/importccl: TestImportPgDump failed

4 participants