migration: introduce various other bells and whistles...#57650
migration: introduce various other bells and whistles...#57650craig[bot] merged 10 commits intocockroachdb:masterfrom
Conversation
5ee7939 to
d6cad63
Compare
tbg
left a comment
There was a problem hiding this comment.
Not done yet, but flushing comments in case I won't be able to revisit for the rest of the day.
Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 5 of 5 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 3 of 3 files at r6.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)
pkg/migration/helper_test.go, line 42 at r7 (raw file):
TestingNewCluster
Does this have to be exported?
tbg
left a comment
There was a problem hiding this comment.
Reviewed 7 of 7 files at r7, 2 of 2 files at r8, 4 of 4 files at r9.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, and @TheSamHuang)
pkg/migration/client_test.go, line 59 at r9 (raw file):
// TODO(irfansharif): We always seem to include a second copy of r1's // desc. Unsure why.
Hmm, let's figure this out. Is it because of meta1 somehow?
pkg/migration/helper.go, line 124 at r8 (raw file):
} client := serverpb.NewMigrationClient(conn) if err := fn(ctx, client); err != nil {
return fn(ctx, client)?
pkg/migration/helper_test.go, line 92 at r7 (raw file):
t.Run("with-node-down", func(t *testing.T) { // Down a node mid-way through execution. We expect EveryNode to error
This isn't half-way through, right? I would've expected you to down the node in the EveryNode closure.
pkg/migration/helper_test.go, line 184 at r7 (raw file):
type testClusterImpl struct { ns nodes err error
// returned from ???
pkg/migration/helper_test.go, line 236 at r7 (raw file):
} func (t *testClusterImpl) downNode(id int) {
Hmm, this is sort of clunky. Now we're not testing (clusterImpl).nodes at all but it kind of makes it seem like we did. Can we factor out the meat in that method and call into it here? Then we could mark id as dead in our fake liveness instance instead and the error would occur naturally in the real code that emits it.
pkg/migration/util.go, line 38 at r7 (raw file):
// and if not, what changed. The textual diff is only to be used for logging // purposes. func (ns nodes) identical(other nodes) (ok bool, diff string) {
Can this take a *redact.StringBuilder instead of returning a string?
pkg/migration/util.go, line 41 at r7 (raw file):
a, b := ns, other if len(a) != len(b) { if len(a) < len(b) {
This isn't quite right, it could have been two nodes added, one removed, etc.
pkg/migration/util.go, line 47 at r7 (raw file):
} return false, diff }
Could something like this be easier? (It has newline problems when there are multiple mismatches, but that could be fixed)
type ent struct { count int; n node, epochChanged bool }
m := map[NodeID]ent{}
for _, n := range ns {
m[n.id] = ent{count:1, n: n, epochChanged: false}
}
for _, n := range other {
e, ok := m[n.id]
e.count--
if ok && e.n.epoch != n.epoch {
e.epochChanged = true
}
m[n.id] = e
}
for _, e := range m {
if e.epochChanged {
buf.Fprint("n%d: epoch changed", n.id)
continue
}
if e.count > 0 {
buf.Fprint("n%d: node decommissioned", n.id)
} else if e.count < 0 {
buf.Fprint("n%d: node joined cluster", n.id)
}
}
return len(buf.Bytes()) == 0
}
tbg
left a comment
There was a problem hiding this comment.
Some more flushing. I did look at everything! Just need to figure out the double descriptor.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, and @TheSamHuang)
d6cad63 to
d8f862f
Compare
Release note: None
Makes for better logging. Release note: None
We can separate out the `Helper`, `Migration`, and various utilities into their own files. We'll add tests for individual components in future commits; the physical separation here sets the foundation for doing so (prototyped in cockroachdb#57445). This commit is purely code movement. Release note: None
It's clearer to talk explicitly in terms of causality. Release note: None
We re-define what the Migration type is to be able to annotate it a description. We'll later use this description when populating the `system.migrations` table (originally prototyped in cockroachdb#57445). Release note: None
We make it a bit more ergonomic (this revision was originally prototyped in cockroachdb#57445). Release note: None
d8f862f to
b3f7aa4
Compare
irfansharif
left a comment
There was a problem hiding this comment.
Addressed the blemishes, PTAL.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @tbg, and @TheSamHuang)
pkg/migration/client_test.go, line 59 at r9 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Hmm, let's figure this out. Is it because of meta1 somehow?
Yup, I cargo-culted from
Line 1685 in 8bfc745
pkg/migration/helper.go, line 124 at r8 (raw file):
Previously, tbg (Tobias Grieger) wrote…
return fn(ctx, client)?
Done.
pkg/migration/helper_test.go, line 42 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
TestingNewClusterDoes this have to be exported?
yup, we use this in the migration_test files. We needed those tests to sit outside pkg/migration to avoid circular dependencies when using test server (pkg/migration -> pkg/testutils/serverutils -> pkg/server -> pkg/migration).
pkg/migration/helper_test.go, line 92 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This isn't half-way through, right? I would've expected you to down the node in the EveryNode closure.
Done.
pkg/migration/helper_test.go, line 184 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
// returned from ???
Done.
pkg/migration/helper_test.go, line 236 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Hmm, this is sort of clunky. Now we're not testing
(clusterImpl).nodesat all but it kind of makes it seem like we did. Can we factor out the meat in that method and call into it here? Then we could markidas dead in our fake liveness instance instead and the error would occur naturally in the real code that emits it.
Figured out a way to mock the deadness at the liveness level instead.
pkg/migration/util.go, line 38 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can this take a
*redact.StringBuilderinstead of returning a string?
Yea, but I like keeping the signature as is.
pkg/migration/util.go, line 41 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This isn't quite right, it could have been two nodes added, one removed, etc.
Done.
pkg/migration/util.go, line 47 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Could something like this be easier? (It has newline problems when there are multiple mismatches, but that could be fixed)
type ent struct { count int; n node, epochChanged bool } m := map[NodeID]ent{} for _, n := range ns { m[n.id] = ent{count:1, n: n, epochChanged: false} } for _, n := range other { e, ok := m[n.id] e.count-- if ok && e.n.epoch != n.epoch { e.epochChanged = true } m[n.id] = e } for _, e := range m { if e.epochChanged { buf.Fprint("n%d: epoch changed", n.id) continue } if e.count > 0 { buf.Fprint("n%d: node decommissioned", n.id) } else if e.count < 0 { buf.Fprint("n%d: node joined cluster", n.id) } } return len(buf.Bytes()) == 0 } </blockquote></details> Ooo, like it. Done. <!-- Sent from Reviewable.io -->
When scanning the KV keyspace for all range descriptors, we want to scan from Meta2Prefix to MetaMax (see meta2RangeIter for other usages of this). In our decomm status codepaths we were erroneously starting off our scan at MetaMin, so including meta1 in our scans as well. This has the effect of iterating over the range descriptor for r1 twice (first observed in cockroachdb#57650 where we cargo-culted these bounds). Release note (bug fix): When displaying replica counts during the decommissioning process, previously we were overcounting the replicas displayed for r1. This is no longer the case.
57812: server: fix decomm status replica overcounting for r1 r=irfansharif a=irfansharif When scanning the KV keyspace for all range descriptors, we want to scan from Meta2Prefix to MetaMax (see meta2RangeIter for other usages of this). In our decomm status codepaths we were erroneously starting off our scan at MetaMin, so including meta1 in our scans as well. This has the effect of iterating over the range descriptor for r1 twice (first observed in #57650 where we cargo-culted these bounds). Release note (bug fix): When displaying replica counts during the decommissioning process, previously we were overcounting the replicas displayed for r1. This is no longer the case. Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
knz
left a comment
There was a problem hiding this comment.
I think r14 and r15 should be squashed: they look like they're moving code from one file to another, but in r14 the code is duplicated. (Or is my reading wrong?) I'd recommend trying to build at every commit individually.
Reviewed 1 of 1 files at r1, 11 of 11 files at r10, 1 of 1 files at r11, 5 of 5 files at r12, 1 of 1 files at r13, 1 of 1 files at r14, 3 of 3 files at r15, 7 of 7 files at r16, 2 of 2 files at r17, 4 of 4 files at r18.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @tbg, and @TheSamHuang)
pkg/migration/helper.go, line 121 at r17 (raw file):
id := node.id // copy out of the loop variable grp.GoCtx(func(ctx context.Context) error { conn, err := h.c.dial(ctx, id)
now that's too much concurrency. I think this needs a concurrency token pool.
pkg/migration/util.go, line 47 at r7 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Ooo, like it. Done.
I don't see it done. Or maybe you forgot to push something?
pkg/server/migration.go, line 43 at r11 (raw file):
ctx context.Context, req *serverpb.ValidateTargetClusterVersionRequest, ) (*serverpb.ValidateTargetClusterVersionResponse, error) { ctx, span := m.server.AnnotateCtxWithSpan(ctx, "validate-cv")
here and below: can you expand -cv into -cluster-version. These strings are cheap and it will make the logs easier to read.
tbg
left a comment
There was a problem hiding this comment.
I have no further comments, ready whenever @knz is.
Reviewed 11 of 11 files at r10, 1 of 1 files at r11, 5 of 5 files at r12, 1 of 1 files at r13, 1 of 1 files at r14, 3 of 3 files at r15, 7 of 7 files at r16, 2 of 2 files at r17, 4 of 4 files at r18.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, and @TheSamHuang)
pkg/migration/helper.go, line 121 at r17 (raw file):
Previously, knz (kena) wrote…
now that's too much concurrency. I think this needs a concurrency token pool.
🤔 good point,so I guess we need a magic constant. 50 sounds fine? We're only paying a grpc request per goroutine here.
pkg/migration/util.go, line 38 at r7 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Yea, but I like keeping the signature as is.
Works for me if it works for @knz, though I don't understand why you "like" this more than taking a redact.StringBuilder.
b3f7aa4 to
11a2e5f
Compare
irfansharif
left a comment
There was a problem hiding this comment.
r14 shows up as the intended comment-only commit for me, and r15 something else. Probably some reviewable funkiness? Also did try building+linting at each commit individually.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @tbg, and @TheSamHuang)
pkg/migration/helper.go, line 121 at r17 (raw file):
Previously, tbg (Tobias Grieger) wrote…
🤔 good point,so I guess we need a magic constant. 50 sounds fine? We're only paying a grpc request per goroutine here.
Done. We're coming full circle using the quotapool I worked on way back when: #15802.
pkg/migration/util.go, line 47 at r7 (raw file):
Previously, knz (kena) wrote…
I don't see it done. Or maybe you forgot to push something?
It is done, it shows up between r15-r16 for me on reviewable, and here on github (commit 7).
pkg/server/migration.go, line 43 at r11 (raw file):
Previously, knz (kena) wrote…
here and below: can you expand
-cvinto-cluster-version. These strings are cheap and it will make the logs easier to read.
Done.
tbg
left a comment
There was a problem hiding this comment.
Reviewed 4 of 4 files at r19, 4 of 4 files at r20, 2 of 2 files at r21.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @irfansharif, @knz, and @TheSamHuang)
pkg/migration/helper.go, line 124 at r19 (raw file):
id := node.id // copy out of the loop variable grp.GoCtx(func(ctx context.Context) error { alloc, err := qp.Acquire(ctx, 1)
This is still unbounded concurrency (the unbounded now sits on Acquire instead of in the RPC call). I think you want to acquire before calling GoCtx and keep releasing inside (note the usual "local in a loop" footgun). The other way to make this work is to spawn 25 goroutines upfront and have them consume from a channel, but I like about the current approach that it starts only the goroutines it needs.
knz
left a comment
There was a problem hiding this comment.
Reviewed 4 of 4 files at r20, 1 of 2 files at r21.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @irfansharif, @knz, and @TheSamHuang)
pkg/migration/util.go, line 92 at r16 (raw file):
// SafeFormat implements redact.SafeFormatter. func (ns nodes) SafeFormat(s redact.SafePrinter, _ rune) { s.SafeString(redact.SafeString(ns.String()))
No that really does not work for me, like, at all.
The very point of the design of the API is to ensure that future changes around the code do not run the risk of introducing unsafe strings. Your usage of SafeString() here is completely at odds with that. You just break all the protections.
The desired idiom is the other way around:
- implement the pretty-printing logic in
SafeFormat - only then implement
String()asreturn redact.StringWithoutMarkers(self)
11a2e5f to
cf9254f
Compare
irfansharif
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz, @tbg, and @TheSamHuang)
pkg/migration/helper.go, line 124 at r19 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This is still unbounded concurrency (the unbounded now sits on
Acquireinstead of in the RPC call). I think you want to acquire before callingGoCtxand keep releasing inside (note the usual "local in a loop" footgun). The other way to make this work is to spawn 25 goroutines upfront and have them consume from a channel, but I like about the current approach that it starts only the goroutines it needs.
gah, good catch. Done.
pkg/migration/util.go, line 92 at r16 (raw file):
Previously, knz (kena) wrote…
No that really does not work for me, like, at all.
The very point of the design of the API is to ensure that future changes around the code do not run the risk of introducing unsafe strings. Your usage ofSafeString()here is completely at odds with that. You just break all the protections.The desired idiom is the other way around:
- implement the pretty-printing logic in
SafeFormat- only then implement
String()asreturn redact.StringWithoutMarkers(self)
Doh, done. That makes a lot more sense. I'll admit (as is glaringly obvious), this was done entirely without thinking.
knz
left a comment
There was a problem hiding this comment.
Reviewed 1 of 7 files at r22, 4 of 4 files at r24, 2 of 2 files at r25.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)
pkg/migration/helper.go, line 150 at r24 (raw file):
if ok, diff := ns.identical(curNodes); !ok { log.Infof(ctx, "[%s], retrying", redact.Safe(diff))
log.Infof(ctx, "[%s], retrying", diff)pkg/migration/util.go, line 38 at r22 (raw file):
// and if not, what changed (in terms of cluster membership operations and epoch // changes). The textual diff is only to be used for logging purposes. func (ns nodes) identical(other nodes) (ok bool, diff string) {
diff redact.RedactableString
pkg/migration/util.go, line 62 at r22 (raw file):
for id, e := range m { if e.epochChanged { diffs = append(diffs, fmt.Sprintf("n%d's epoch changed", id))
redact.Sprintf(...) here
pkg/migration/util.go, line 65 at r22 (raw file):
} if e.count > 0 { diffs = append(diffs, fmt.Sprintf("n%d was decommissioned", id))
ditto
pkg/migration/util.go, line 68 at r22 (raw file):
} if e.count < 0 { diffs = append(diffs, fmt.Sprintf("n%d joined the cluster", id))
ditto
pkg/migration/util.go, line 72 at r22 (raw file):
} sort.Strings(diffs)
hold on for a bit and you'll have redact.SortStrings here
pkg/migration/util.go, line 82 at r22 (raw file):
// SafeFormat implements redact.SafeFormatter. func (ns nodes) SafeFormat(s redact.SafePrinter, _ rune) { var b strings.Builder
ok try this:
s.SafeString("n{")
if len(ns) > 0 {
s.Printf("%d", ns[0].id)
for _, node := range ns[1:] {
s.Printf(",%d", node.id)
}
}
s.SafeString("}") // or b.SafeRune('}')
knz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)
pkg/migration/util.go, line 73 at r22 (raw file):
sort.Strings(diffs) return len(diffs) == 0, strings.Join(diffs, ", ")
I'd also suggest bluntly returning the slice here, instead of joining it (and sorting it)
let the caller decide what it wants
In the particular log call that uses this today, we don't even need to sort!
knz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)
pkg/migration/util.go, line 72 at r22 (raw file):
Previously, knz (kena) wrote…
hold on for a bit and you'll have
redact.SortStringshere
cf9254f to
e876d9b
Compare
irfansharif
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)
pkg/migration/helper.go, line 150 at r24 (raw file):
Previously, knz (kena) wrote…
log.Infof(ctx, "[%s], retrying", diff)
Done.
pkg/migration/util.go, line 38 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Works for me if it works for @knz, though I don't understand why you "like" this more than taking a
redact.StringBuilder.
I've now changed it as originally suggested, but I'll take the podium here anyway. I think my preference came from wanting my base utility functions (like this contextual diff generator) to not need to know or care about "redactability" whatsoever. A function signature with the named return type diff string to me is ever-so-slightly more understandable than seeing our redaction package being plumbed into it. As for the point at which we're logging the diff, I'm happy to mark the string as redact.Safe by hand, as opposed to plumbing a redact.<type> all throughout up until that point.
Though I'll admit I'm only now realizing this is at odds with us wanting all logging to be safe by construction/non-redaction to be something opted out of.
pkg/migration/util.go, line 72 at r22 (raw file):
Previously, knz (kena) wrote…
I'll skip the string sorting in this PR, you're right that we don't actually need it.
pkg/migration/util.go, line 73 at r22 (raw file):
Previously, knz (kena) wrote…
I'd also suggest bluntly returning the slice here, instead of joining it (and sorting it)
let the caller decide what it wantsIn the particular
logcall that uses this today, we don't even need to sort!
Done. Thanks for walking me through this 🙌
pkg/migration/util.go, line 82 at r22 (raw file):
Previously, knz (kena) wrote…
ok try this:
s.SafeString("n{") if len(ns) > 0 { s.Printf("%d", ns[0].id) for _, node := range ns[1:] { s.Printf(",%d", node.id) } } s.SafeString("}") // or b.SafeRune('}') </blockquote></details> Done. <!-- Sent from Reviewable.io -->
knz
left a comment
There was a problem hiding this comment.
Reviewed 2 of 8 files at r26, 4 of 4 files at r28, 2 of 2 files at r29.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)
pkg/migration/util.go, line 85 at r27 (raw file):
} } s.SafeString("}") // or b.SafeRune('}')
you get to choose which of the two you want, no need to keep the comment
To facilitate testing `Helper` in isolation, we introduce a `cluster` interface that we'll mock out in tests. It's through this interface that the migration infrastructure will able to dial out to a specific node, grab hold of a kv.DB instance, and retrieve the current cluster membership. Part of diff also downgrades `RequiredNodes` from being a first class primitive, instead tucking it away for internal usage only. Given retrieving the cluster membership made no guarantees about new nodes being added to the cluster, it's entirely possible for that to happen concurrently with it. Appropriate usage then entailed wrapping it under a stabilizing loop, like we do so in `EveryNode`. This tells us there's no need to expose it directly to migration authors. Release note: None
Release note: None
It's not currently wired up to anything. We'll use it in future PRs to send out `Migrate` requests to the entire keyspace. This was originally prototyped in cockroachdb#57445. See the inline comments and the RFC (cockroachdb#48843) for the motivation here. Release note: None
Release note: None
e876d9b to
a0c5160
Compare
|
phew, TFTRs! bors r+ |
|
Build failed: |
|
Looks like bors retry |
|
Build succeeded: |
...to pave the way for real migrations
This PR cannibalizes most of our prototype in #57445 and shimmies in a few
interfaces to make it all unit-testable. While here, we:
specific metadata (it also paves the way for logging the progress of an
ongoing migration to a system table).
See individual commits for details.