sql: use rangefeeds for notifications about changes to table descriptors#48159
Conversation
|
❌ The GitHub CI (Cockroach) build has failed on 915433e0. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
915433e to
72300f9
Compare
|
❌ The GitHub CI (Cockroach) build has failed on 72300f9e. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
72300f9 to
06730b9
Compare
|
❌ The GitHub CI (Cockroach) build has failed on 06730b90. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
06730b9 to
a71ad01
Compare
|
❌ The GitHub CI (Cockroach) build has failed on a71ad012. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
a71ad01 to
a9337ae
Compare
|
❌ The GitHub CI (Cockroach) build has failed on a9337aec. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
|
Exciting! |
a9337ae to
d9feebd
Compare
|
❌ The GitHub CI (Cockroach) build has failed on d9feebde. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
41f1a61 to
ff5ac79
Compare
|
❌ The GitHub CI (Cockroach) build has failed on ff5ac797. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
ff5ac79 to
348e4da
Compare
|
@spaskob @lucy-zhang - I know there's two tests left to fix and I'm working on them but it might be worth reviewing this. If the split between commits 3 and 5 are too hard to review, let me know and I'll put some time into rebasing them in a more sane way. |
|
I'm also going to add a test that ensures that it's okay to start up a lease manager with a nil gossip assuming the cluster version is appropriate. I'm deferring the work to stop gossiping the descriptors table and to remove the restriction on the |
348e4da to
2df1aea
Compare
|
❌ The GitHub CI (Cockroach) build has failed on 2df1aea0. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
jordanlewis
left a comment
There was a problem hiding this comment.
Really cool stuff... Andrew, I realized half way through reviewing that I'm not sure whether you'd prefer reviewers to review in "atomic commits" or "full PR" style. Looks like you're going for the latter which I'm fine with but just checking.
Reviewed 7 of 7 files at r1, 3 of 3 files at r2, 7 of 7 files at r3, 3 of 3 files at r4, 3 of 3 files at r5.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @lucy-zhang, and @spaskob)
pkg/kv/kvserver/client_replica_test.go, line 3142 at r1 (raw file):
} func TestProposalOverhead(t *testing.T) {
Mind adding a comment on the purpose of this test?
pkg/kv/kvserver/replica_write.go, line 573 at r1 (raw file):
batch := r.store.Engine().NewBatch() var opLogger *storage.OpLoggerBatch if r.isSystemRange() || RangefeedEnabled.Get(&r.store.cfg.Settings.SV) {
Nice, this will improve my situation for LISTEN/NOTIFY as well.
pkg/sql/lease.go, line 895 at r3 (raw file):
// // TODO(ajwerner): This really just needs to acquire a lease on the ID newer // than something. Not necessarily newer than
nit: this is an incomplete comment.
pkg/sql/lease.go, line 1795 at r3 (raw file):
// we don't know if rangefeeds are enabled. Grr. I guess the deal is that when // it gets enabled then we flip the switch? We don't want to force the nodes // to restart to get the goodness. We also want to be able to stop gossipping.
I'm not sure but I think these comments might be your internal brainstorming.
pkg/sql/lease.go, line 1919 at r3 (raw file):
// change occurred. const withDiff = false log.Infof(ctx, "starting rangefeed from %v on %v", m.initialTimestamp, span)
This might want to be a VEventf, but if not, you should probably say "starting lease rangefeed" to make it more clear.
pkg/sql/lease.go, line 1940 at r3 (raw file):
var descriptor sqlbase.Descriptor if err := ev.Value.GetProto(&descriptor); err != nil { log.Fatalf(ctx, "%s: unable to unmarshal descriptor %v", ev.Key, ev.Value)
I don't think this should be a Fatal. A corrupted descriptor would be bad, but I'd prefer not to see an unrecoverable crashloop from it.
pkg/sql/lease.go, line 1961 at r3 (raw file):
} if log.V(0) { log.Infof(ctx, "%s: refreshing lease table: %d (%s), version: %d, dropped: %t",
Should this be V(1)?
pkg/sql/lease.go, line 1977 at r3 (raw file):
case e := <-eventCh: if e.Checkpoint != nil { log.Infof(ctx, "got checkpoint %v", e.Checkpoint)
V() it?
pkg/sql/lease.go, line 2106 at r5 (raw file):
) chan struct{} { // TODO(ajwerner): Add a callback to notify about version changes. // Checking is pretty cheap but really this should be a callback.
Isn't there a cluster setting that gets changed when a version is finalized? Cluster settings have callbacks attached to them, could you use that?
pkg/sql/lease_test.go, line 2161 at r3 (raw file):
}, Server: &server.TestingKnobs{ // TODO(ajwerner): A
☝️
pkg/sql/lease_test.go, line 2172 at r3 (raw file):
db := tc.ServerConn(0) tdb := sqlutils.MakeSQLRunner(db) tdb.Exec(t, "SET CLUSTER SETTING kv.closed_timestamp.target_duration = '100ms';")
Why does this need to be set?
pkg/sql/lease_test.go, line 2229 at r3 (raw file):
require.NoError(t, txA.Rollback()) const extremelyLongTime = 100 * time.Second
This is going to be kind of annoying in CI if it starts failing, can it be less than 100 secs?
I messed this PR up a bit and didn't really know what to do about it. commits 3 and 5 are super tightly coupled. I probably should just merge them. Then commit-by-commit would work. My bad. I knew it was sort of a mess. |
2df1aea to
930c38a
Compare
|
❌ The GitHub CI (Cockroach) build has failed on 930c38a1. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
bf5705e to
64e4730
Compare
jordanlewis
left a comment
There was a problem hiding this comment.
Can we get ahead of the testing game here by writing a mixed logical version test? I'm not sure what parts we'd want to specifically stress, but just wanted to float the general idea.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @lucy-zhang, and @spaskob)
pkg/sql/lease.go, line 1940 at r3 (raw file):
Previously, ajwerner wrote…
Fair enough. Moved it to be a log.Error. Do we want to send a report or something when this happens? I don't know how to take a more drastic measure than
log.Errorthat's less drastic than fatal.
You can use log.ReportOrPanic.
pkg/sql/lease.go, line 1866 at r13 (raw file):
} case <-s.ShouldStop():
Should this be ShouldQuiesce? I've had some trouble with servers hanging when they dont' quit until the Stop signal. It's possible I dont' fully understand when to use which.
51fce16 to
1214c0e
Compare
ajwerner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @lucy-zhang, and @spaskob)
pkg/sql/lease.go, line 1940 at r3 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
You can use
log.ReportOrPanic.
Done.
pkg/sql/lease.go, line 1866 at r13 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Should this be ShouldQuiesce? I've had some trouble with servers hanging when they dont' quit until the Stop signal. It's possible I dont' fully understand when to use which.
I used ShouldStop here because the old loop used ShouldStop. It seems like there are two possibilities:
- We have some code that wants to run unimpeded and complete a schema change cleanly while we're quiescing (seems unlikely)
- This uses ShouldStop for no reason (seems likely)
Given the is 2015 code I was cargo culting, I'm going to go with 2).
Done.
ajwerner
left a comment
There was a problem hiding this comment.
Can we get ahead of the testing game here by writing a mixed logical version test? I'm not sure what parts we'd want to specifically stress, but just wanted to float the general idea.
The mixed-version schema change tests that @spaskob has been working on should stress this quite a bit. I'm not sure what more I would want than that.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @lucy-zhang, and @spaskob)
|
❌ The GitHub CI (Cockroach) build has failed on 1214c0e9. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
1214c0e to
1efb1d1
Compare
|
❌ The GitHub CI (Cockroach) build has failed on 1efb1d1d. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
15eb36b to
e5d090f
Compare
|
❌ The GitHub CI (Cockroach) build has failed on e5d090f2. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
e5d090f to
dad46a3
Compare
|
❌ The GitHub CI (Cockroach) build has failed on dad46a39. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
dad46a3 to
8f08c4f
Compare
thoszhang
left a comment
There was a problem hiding this comment.
Reviewed 1 of 16 files at r6, 1 of 15 files at r12, 9 of 12 files at r13, 6 of 8 files at r14, 1 of 3 files at r15, 2 of 2 files at r16.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, and @spaskob)
pkg/sql/lease.go, line 732 at r15 (raw file):
// acquisition finishes but indicate that that new lease is expired are not // ignored. acquisitionsInProgress int
This is fixing a pre-existing race condition, right?
pkg/sql/lease.go, line 1823 at r16 (raw file):
// leases for tables received in the latest system configuration via gossip or // rangefeeds. This function must be passed a non-nil gossip if // VersionRangefeeedLeases is not active.
nit: Rangefeed
pkg/sql/lease.go, line 1855 at r16 (raw file):
// Try to refresh the table lease to one >= this version. if err := purgeOldVersions(
I might log at level 2 or 3 somewhere around here indicating we got a new version.
pkg/sql/lease.go, line 1899 at r16 (raw file):
// Note: It's okay that the cancelation of gossip watching is // asynchronous. At worst we'd get duplicate updates or stale updates. // Both of those are handled.
Would you mind briefly commenting on why we're guaranteed to not miss any updates when we switch over? If we pass the last seen timestamp that we got from gossip to the rangefeed, are we guaranteed to get all updates since that timestamp?
pkg/sql/lease_test.go, line 220 at r16 (raw file):
t.cfg, ) mgr.PeriodicallyRefreshSomeLeases(context.Background())
Maybe add a log tag?
8f08c4f to
82d3090
Compare
|
❌ The GitHub CI (Cockroach) build has failed on 82d30907. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Release note: None
…m config Before this commit, rangefeeds were enabled on ranges only if the cluster setting `kv.rangefeed.enabled` was set to true. That's not going to fly if we're going to have critical system functionality rely on rangefeeds. We still don't want users paying for the penalty of rangefeeds unless they opt in but we don't care about the ~5% overhead for system config operations. Release note: None
The pickle we'll want to eventually deal with is a way of plumbing into the system the idea that we don't need this gossip dependency. We need to make sure that we start tracking events for a table before we successfully send off the request to read the initial value. TODO(ajwerner): Maybe just allow it to be nil but note that if it is nil then you're going to have a bad time if for whatever reason this is a mixed version cluster (i.e. return an error when the cluster version hasn't been finalized and). Release note: None
This knob is handy for injecting and observing rangefeed behavior. Release note: None
This commit additionally has a bunch of cleanup of the previous commit. Release note: None
The test would hit an assertion in the leasemanager otherwise. Release note: None
82d3090 to
9bf0ceb
Compare
ajwerner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @lucy-zhang, and @spaskob)
pkg/sql/lease.go, line 1866 at r13 (raw file):
Previously, ajwerner wrote…
I used
ShouldStophere because the old loop usedShouldStop. It seems like there are two possibilities:
- We have some code that wants to run unimpeded and complete a schema change cleanly while we're quiescing (seems unlikely)
- This uses ShouldStop for no reason (seems likely)
Given the is 2015 code I was cargo culting, I'm going to go with
2).Done.
Done.
pkg/sql/lease.go, line 732 at r15 (raw file):
Previously, lucy-zhang (Lucy Zhang) wrote…
This is fixing a pre-existing race condition, right?
Correct.
pkg/sql/lease.go, line 1823 at r16 (raw file):
Previously, lucy-zhang (Lucy Zhang) wrote…
nit:
Rangefeed
Done.
pkg/sql/lease.go, line 1855 at r16 (raw file):
Previously, lucy-zhang (Lucy Zhang) wrote…
I might log at level 2 or 3 somewhere around here indicating we got a new version.
Done.
pkg/sql/lease.go, line 1899 at r16 (raw file):
are we guaranteed to get all updates since that timestamp?
Indeed.
Done.
pkg/sql/lease_test.go, line 220 at r16 (raw file):
Previously, lucy-zhang (Lucy Zhang) wrote…
Maybe add a log tag?
Done.
|
TFTR! bors r=lucyzhang |
Build succeeded |
This PR comes in several commits:
Relates to but does not fully resolve #47150. Another PR will follow-up and disable the gossip of the descriptors table and lift the enforcement of the system config transaction anchor.