Skip to content

descs: fix txn commit waiting on wrong lease version#87706

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
postamar:fix-87672
Sep 9, 2022
Merged

descs: fix txn commit waiting on wrong lease version#87706
craig[bot] merged 1 commit intocockroachdb:masterfrom
postamar:fix-87672

Conversation

@postamar
Copy link
Copy Markdown

@postamar postamar commented Sep 9, 2022

Recent work on the descs.Collection (PR #87067) introduced a regression
in which it would return bad lease.IDVersion versions for the uncommitted
descriptors.

Fixes #87672.

Release justification: important bug fix
Release note: None

@postamar postamar requested a review from a team September 9, 2022 14:53
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM though I don't have a totally clear understanding as to how it happens.

"expected original version of uncommitted %s %q (%d) to be non-zero",
uncommitted.DescriptorType(), uncommitted.GetName(), uncommitted.GetID())
}
// Ignore uncommitted descriptors which for whatever reason have not
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you understand the reason?

Copy link
Copy Markdown
Author

@postamar postamar Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a reason unknown to me we allow uncommitted descriptors to be added with the same version as the cluster version, in addition to clusterVersion+1. This is not new and I haven't questioned it.

Prior to #87067 these were excluded from consideration by an early return on !mut.IsUncommittedDescriptor(), which happens when the version counter hasn't actually been bumped.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should file an issue to tighten this. I can't see any instances where AddUncommittedDescriptor isn't preceded by a MaybeIncrementVersion call, unless it's obviously a new descriptor.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old world, before your rewrite, we'd add to uncommittedDescriptors or whatever it was then called (storedDescriptors) on the read path, and the caller which resolved that descriptor might decide to throw that away.

I'd still like to understand how we were modifying this set which, as you say, shouldn't have anything added to it unless it is being written. One note is that the thing which triggers the bug is the underlying call to GetAllDescriptors.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this and the bug is deeper, even though the changes brought by this PR are good.

The problem is that we upsert into the storage cache when really we should be inserting. Consider this test case which you supplied. What happens in that transaction is that CREATE SCHEMA sc2 adds an uncommitted db descriptor to the collection with the new schema entry and bumps its version counter. It then Runs that batch. The subsequent crdb_internal.tables query calls a GetAllDescriptors which overwrites the storage cache with the uncommitted descriptor!

This is why the original descriptors in the uncommitted layer can't be relied on.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why the original descriptors in the uncommitted layer can't be relied on.

Yeah... this seems like the right answer. I think we should avoid touching original in AddUncommittedDescriptor when we already have an entry for the descriptor as opposed to pulling from the stored cache.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I'll revert to draft mode and push a fix in that direction and we'll see what happens.

) error {
return ud.uncommitted.IterateByID(func(entry catalog.NameEntry) error {
if o := ud.original.Get(entry.GetID()); o != nil {
return fn(lease.NewIDVersionPrev(o.GetName(), o.GetID(), o.(catalog.Descriptor).GetVersion()))
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug proper was that fn was being called a bit too liberally here. The regression was introduced in this PR, I highlighted the line with the relevant part of the old logic: https://github.com/cockroachdb/cockroach/pull/87067/files#diff-74473a29d93e28d8f048009aed817c687c38ab748f111c95fba1ec148f7c0768L664

I fixed this and pushed a more verbose version of this code down to the call site in the hopes of clarifying what it's supposed to do.

@postamar postamar marked this pull request as draft September 9, 2022 15:51
@postamar
Copy link
Copy Markdown
Author

postamar commented Sep 9, 2022

Well wouldn't you know it, we're updating view back-references in tables based on leased descriptors.

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this feels like the right fix!

LGTM

@postamar
Copy link
Copy Markdown
Author

postamar commented Sep 9, 2022

Yes I feel much better too! Glad to get rid of this weirdness with the versions.

Recent work on the descs.Collection (PR cockroachdb#87067) introduced a regression
in which it would return bad lease.IDVersion versions for the uncommitted
descriptors.

In addition to fixing this, this commit corrects the lifecycle of the
`original` descriptors in the uncommitted layer, which now remain
unaffected by changes in the storage layer.

This commit tightens the constraints on what can be passed to
AddUncommittedDescriptor.

Fixes cockroachdb#87672.

Release justification: important bug fix
Release note: None
@postamar postamar marked this pull request as ready for review September 9, 2022 18:48
@postamar
Copy link
Copy Markdown
Author

postamar commented Sep 9, 2022

Thanks for the review.

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 9, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql: hang possible due to bad version bookkeeping

3 participants