Skip to content

sql: use rangefeeds for notifications about changes to table descriptors#48159

Merged
craig[bot] merged 6 commits intocockroachdb:masterfrom
ajwerner:ajwerner/rangefeeds-for-schema-leases
May 12, 2020
Merged

sql: use rangefeeds for notifications about changes to table descriptors#48159
craig[bot] merged 6 commits intocockroachdb:masterfrom
ajwerner:ajwerner/rangefeeds-for-schema-leases

Conversation

@ajwerner
Copy link
Copy Markdown
Contributor

@ajwerner ajwerner commented Apr 29, 2020

This PR comes in several commits:

  1. Enable rangefeeds unconditionally on system ranges
  • this is going to need to some modification to work with tenant system spans
  1. Add a cluster version for the above functionality
  2. Add the basic (messy) implementation of rangefeed based lease manager notifications
  3. Add some testing knobs
  4. Clean up 3 in a number of ways, add some more testing, fix a synchronization issue.
  5. Updates a test to not break a new assertion that values in the descriptor table actually be descriptors

Relates to but does not fully resolve #47150. Another PR will follow-up and disable the gossip of the descriptors table and lift the enforcement of the system config transaction anchor.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 29, 2020

❌ The GitHub CI (Cockroach) build has failed on 915433e0.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@ajwerner ajwerner added the A-schema-descriptors Relating to SQL table/db descriptor handling. label Apr 29, 2020
@ajwerner ajwerner force-pushed the ajwerner/rangefeeds-for-schema-leases branch from 915433e to 72300f9 Compare April 29, 2020 15:31
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 29, 2020

❌ The GitHub CI (Cockroach) build has failed on 72300f9e.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@ajwerner ajwerner force-pushed the ajwerner/rangefeeds-for-schema-leases branch from 72300f9 to 06730b9 Compare April 30, 2020 14:14
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 30, 2020

❌ The GitHub CI (Cockroach) build has failed on 06730b90.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@ajwerner ajwerner force-pushed the ajwerner/rangefeeds-for-schema-leases branch from 06730b9 to a71ad01 Compare April 30, 2020 14:53
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 30, 2020

❌ The GitHub CI (Cockroach) build has failed on a71ad012.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@ajwerner ajwerner force-pushed the ajwerner/rangefeeds-for-schema-leases branch from a71ad01 to a9337ae Compare April 30, 2020 18:35
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 30, 2020

❌ The GitHub CI (Cockroach) build has failed on a9337aec.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@jordanlewis
Copy link
Copy Markdown
Member

Exciting!

@ajwerner ajwerner force-pushed the ajwerner/rangefeeds-for-schema-leases branch from a9337ae to d9feebd Compare April 30, 2020 18:55
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 30, 2020

❌ The GitHub CI (Cockroach) build has failed on d9feebde.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@ajwerner ajwerner force-pushed the ajwerner/rangefeeds-for-schema-leases branch from 41f1a61 to ff5ac79 Compare April 30, 2020 22:58
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 30, 2020

❌ The GitHub CI (Cockroach) build has failed on ff5ac797.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@ajwerner ajwerner force-pushed the ajwerner/rangefeeds-for-schema-leases branch from ff5ac79 to 348e4da Compare May 4, 2020 14:11
@ajwerner ajwerner requested review from spaskob and thoszhang May 4, 2020 14:29
@ajwerner
Copy link
Copy Markdown
Contributor Author

ajwerner commented May 4, 2020

@spaskob @lucy-zhang - I know there's two tests left to fix and I'm working on them but it might be worth reviewing this. If the split between commits 3 and 5 are too hard to review, let me know and I'll put some time into rebasing them in a more sane way.

@ajwerner
Copy link
Copy Markdown
Contributor Author

ajwerner commented May 4, 2020

I'm also going to add a test that ensures that it's okay to start up a lease manager with a nil gossip assuming the cluster version is appropriate. I'm deferring the work to stop gossiping the descriptors table and to remove the restriction on the SetSystemConfigTrigger() for a later PR.

@ajwerner ajwerner force-pushed the ajwerner/rangefeeds-for-schema-leases branch from 348e4da to 2df1aea Compare May 4, 2020 14:49
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented May 4, 2020

❌ The GitHub CI (Cockroach) build has failed on 2df1aea0.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Copy link
Copy Markdown
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool stuff... Andrew, I realized half way through reviewing that I'm not sure whether you'd prefer reviewers to review in "atomic commits" or "full PR" style. Looks like you're going for the latter which I'm fine with but just checking.

Reviewed 7 of 7 files at r1, 3 of 3 files at r2, 7 of 7 files at r3, 3 of 3 files at r4, 3 of 3 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @lucy-zhang, and @spaskob)


pkg/kv/kvserver/client_replica_test.go, line 3142 at r1 (raw file):

}

func TestProposalOverhead(t *testing.T) {

Mind adding a comment on the purpose of this test?


pkg/kv/kvserver/replica_write.go, line 573 at r1 (raw file):

	batch := r.store.Engine().NewBatch()
	var opLogger *storage.OpLoggerBatch
	if r.isSystemRange() || RangefeedEnabled.Get(&r.store.cfg.Settings.SV) {

Nice, this will improve my situation for LISTEN/NOTIFY as well.


pkg/sql/lease.go, line 895 at r3 (raw file):

//
// TODO(ajwerner): This really just needs to acquire a lease on the ID newer
// than something. Not necessarily newer than

nit: this is an incomplete comment.


pkg/sql/lease.go, line 1795 at r3 (raw file):

	// we don't know if rangefeeds are enabled. Grr. I guess the deal is that when
	// it gets enabled then we flip the switch? We don't want to force the nodes
	// to restart to get the goodness. We also want to be able to stop gossipping.

I'm not sure but I think these comments might be your internal brainstorming.


pkg/sql/lease.go, line 1919 at r3 (raw file):

			// change occurred.
			const withDiff = false
			log.Infof(ctx, "starting rangefeed from %v on %v", m.initialTimestamp, span)

This might want to be a VEventf, but if not, you should probably say "starting lease rangefeed" to make it more clear.


pkg/sql/lease.go, line 1940 at r3 (raw file):

		var descriptor sqlbase.Descriptor
		if err := ev.Value.GetProto(&descriptor); err != nil {
			log.Fatalf(ctx, "%s: unable to unmarshal descriptor %v", ev.Key, ev.Value)

I don't think this should be a Fatal. A corrupted descriptor would be bad, but I'd prefer not to see an unrecoverable crashloop from it.


pkg/sql/lease.go, line 1961 at r3 (raw file):

		}
		if log.V(0) {
			log.Infof(ctx, "%s: refreshing lease table: %d (%s), version: %d, dropped: %t",

Should this be V(1)?


pkg/sql/lease.go, line 1977 at r3 (raw file):

			case e := <-eventCh:
				if e.Checkpoint != nil {
					log.Infof(ctx, "got checkpoint %v", e.Checkpoint)

V() it?


pkg/sql/lease.go, line 2106 at r5 (raw file):

) chan struct{} {
	// TODO(ajwerner): Add a callback to notify about version changes.
	// Checking is pretty cheap but really this should be a callback.

Isn't there a cluster setting that gets changed when a version is finalized? Cluster settings have callbacks attached to them, could you use that?


pkg/sql/lease_test.go, line 2161 at r3 (raw file):

				},
				Server: &server.TestingKnobs{
					// TODO(ajwerner): A

☝️


pkg/sql/lease_test.go, line 2172 at r3 (raw file):

	db := tc.ServerConn(0)
	tdb := sqlutils.MakeSQLRunner(db)
	tdb.Exec(t, "SET CLUSTER SETTING kv.closed_timestamp.target_duration = '100ms';")

Why does this need to be set?


pkg/sql/lease_test.go, line 2229 at r3 (raw file):

	require.NoError(t, txA.Rollback())

	const extremelyLongTime = 100 * time.Second

This is going to be kind of annoying in CI if it starts failing, can it be less than 100 secs?

@ajwerner
Copy link
Copy Markdown
Contributor Author

ajwerner commented May 4, 2020

I realized half way through reviewing that I'm not sure whether you'd prefer reviewers to review in "atomic commits" or "full PR" style. Looks like you're going for the latter which I'm fine with but just checking.

I messed this PR up a bit and didn't really know what to do about it. commits 3 and 5 are super tightly coupled. I probably should just merge them. Then commit-by-commit would work. My bad. I knew it was sort of a mess.

@ajwerner ajwerner force-pushed the ajwerner/rangefeeds-for-schema-leases branch from 2df1aea to 930c38a Compare May 4, 2020 18:23
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented May 4, 2020

❌ The GitHub CI (Cockroach) build has failed on 930c38a1.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@ajwerner ajwerner force-pushed the ajwerner/rangefeeds-for-schema-leases branch from bf5705e to 64e4730 Compare May 7, 2020 17:14
Copy link
Copy Markdown
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get ahead of the testing game here by writing a mixed logical version test? I'm not sure what parts we'd want to specifically stress, but just wanted to float the general idea.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @lucy-zhang, and @spaskob)


pkg/sql/lease.go, line 1940 at r3 (raw file):

Previously, ajwerner wrote…

Fair enough. Moved it to be a log.Error. Do we want to send a report or something when this happens? I don't know how to take a more drastic measure than log.Error that's less drastic than fatal.

You can use log.ReportOrPanic.


pkg/sql/lease.go, line 1866 at r13 (raw file):

				}

			case <-s.ShouldStop():

Should this be ShouldQuiesce? I've had some trouble with servers hanging when they dont' quit until the Stop signal. It's possible I dont' fully understand when to use which.

@ajwerner ajwerner force-pushed the ajwerner/rangefeeds-for-schema-leases branch 2 times, most recently from 51fce16 to 1214c0e Compare May 7, 2020 23:01
Copy link
Copy Markdown
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @lucy-zhang, and @spaskob)


pkg/sql/lease.go, line 1940 at r3 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

You can use log.ReportOrPanic.

Done.


pkg/sql/lease.go, line 1866 at r13 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Should this be ShouldQuiesce? I've had some trouble with servers hanging when they dont' quit until the Stop signal. It's possible I dont' fully understand when to use which.

I used ShouldStop here because the old loop used ShouldStop. It seems like there are two possibilities:

  1. We have some code that wants to run unimpeded and complete a schema change cleanly while we're quiescing (seems unlikely)
  2. This uses ShouldStop for no reason (seems likely)

Given the is 2015 code I was cargo culting, I'm going to go with 2).

Done.

Copy link
Copy Markdown
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get ahead of the testing game here by writing a mixed logical version test? I'm not sure what parts we'd want to specifically stress, but just wanted to float the general idea.

The mixed-version schema change tests that @spaskob has been working on should stress this quite a bit. I'm not sure what more I would want than that.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @lucy-zhang, and @spaskob)

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented May 7, 2020

❌ The GitHub CI (Cockroach) build has failed on 1214c0e9.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@ajwerner ajwerner force-pushed the ajwerner/rangefeeds-for-schema-leases branch from 1214c0e to 1efb1d1 Compare May 7, 2020 23:56
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented May 8, 2020

❌ The GitHub CI (Cockroach) build has failed on 1efb1d1d.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@ajwerner ajwerner force-pushed the ajwerner/rangefeeds-for-schema-leases branch 2 times, most recently from 15eb36b to e5d090f Compare May 8, 2020 05:22
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented May 8, 2020

❌ The GitHub CI (Cockroach) build has failed on e5d090f2.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@ajwerner ajwerner force-pushed the ajwerner/rangefeeds-for-schema-leases branch from e5d090f to dad46a3 Compare May 8, 2020 13:39
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented May 8, 2020

❌ The GitHub CI (Cockroach) build has failed on dad46a39.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@ajwerner ajwerner force-pushed the ajwerner/rangefeeds-for-schema-leases branch from dad46a3 to 8f08c4f Compare May 8, 2020 17:46
Copy link
Copy Markdown

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 16 files at r6, 1 of 15 files at r12, 9 of 12 files at r13, 6 of 8 files at r14, 1 of 3 files at r15, 2 of 2 files at r16.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, and @spaskob)


pkg/sql/lease.go, line 732 at r15 (raw file):

		// acquisition finishes but indicate that that new lease is expired are not
		// ignored.
		acquisitionsInProgress int

This is fixing a pre-existing race condition, right?


pkg/sql/lease.go, line 1823 at r16 (raw file):

// leases for tables received in the latest system configuration via gossip or
// rangefeeds. This function must be passed a non-nil gossip if
// VersionRangefeeedLeases is not active.

nit: Rangefeed


pkg/sql/lease.go, line 1855 at r16 (raw file):

				// Try to refresh the table lease to one >= this version.
				if err := purgeOldVersions(

I might log at level 2 or 3 somewhere around here indicating we got a new version.


pkg/sql/lease.go, line 1899 at r16 (raw file):

			// Note: It's okay that the cancelation of gossip watching is
			// asynchronous. At worst we'd get duplicate updates or stale updates.
			// Both of those are handled.

Would you mind briefly commenting on why we're guaranteed to not miss any updates when we switch over? If we pass the last seen timestamp that we got from gossip to the rangefeed, are we guaranteed to get all updates since that timestamp?


pkg/sql/lease_test.go, line 220 at r16 (raw file):

			t.cfg,
		)
		mgr.PeriodicallyRefreshSomeLeases(context.Background())

Maybe add a log tag?

@ajwerner ajwerner force-pushed the ajwerner/rangefeeds-for-schema-leases branch from 8f08c4f to 82d3090 Compare May 12, 2020 05:13
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented May 12, 2020

❌ The GitHub CI (Cockroach) build has failed on 82d30907.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Andrew Werner added 6 commits May 12, 2020 10:22
…m config

Before this commit, rangefeeds were enabled on ranges only if the cluster
setting `kv.rangefeed.enabled` was set to true. That's not going to fly if
we're going to have critical system functionality rely on rangefeeds. We
still don't want users paying for the penalty of rangefeeds unless they opt in
but we don't care about the ~5% overhead for system config operations.

Release note: None
The pickle we'll want to eventually deal with is a way of plumbing into the
system the idea that we don't need this gossip dependency.

We need to make sure that we start tracking events for a table before we
successfully send off the request to read the initial value.

TODO(ajwerner): Maybe just allow it to be nil but note that if it is nil then
you're going to have a bad time if for whatever reason this is a mixed version
cluster (i.e. return an error when the cluster version hasn't been finalized
and).

Release note: None
This knob is handy for injecting and observing rangefeed behavior.

Release note: None
This commit additionally has a bunch of cleanup of the previous commit.

Release note: None
The test would hit an assertion in the leasemanager otherwise.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/rangefeeds-for-schema-leases branch from 82d3090 to 9bf0ceb Compare May 12, 2020 14:22
Copy link
Copy Markdown
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @lucy-zhang, and @spaskob)


pkg/sql/lease.go, line 1866 at r13 (raw file):

Previously, ajwerner wrote…

I used ShouldStop here because the old loop used ShouldStop. It seems like there are two possibilities:

  1. We have some code that wants to run unimpeded and complete a schema change cleanly while we're quiescing (seems unlikely)
  2. This uses ShouldStop for no reason (seems likely)

Given the is 2015 code I was cargo culting, I'm going to go with 2).

Done.

Done.


pkg/sql/lease.go, line 732 at r15 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

This is fixing a pre-existing race condition, right?

Correct.


pkg/sql/lease.go, line 1823 at r16 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

nit: Rangefeed

Done.


pkg/sql/lease.go, line 1855 at r16 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

I might log at level 2 or 3 somewhere around here indicating we got a new version.

Done.


pkg/sql/lease.go, line 1899 at r16 (raw file):

are we guaranteed to get all updates since that timestamp?

Indeed.
Done.


pkg/sql/lease_test.go, line 220 at r16 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

Maybe add a log tag?

Done.

@ajwerner
Copy link
Copy Markdown
Contributor Author

TFTR!

bors r=lucyzhang

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 12, 2020

Build succeeded

@craig craig bot merged commit 91b2880 into cockroachdb:master May 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-multitenancy Related to multi-tenancy A-schema-descriptors Relating to SQL table/db descriptor handling.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql: break gossip dep for table descriptors / table leases

5 participants