rfc: mini RFC for refactoring node IDs by RaduBerinde · Pull Request #73309 · cockroachdb/cockroach

RaduBerinde · 2021-11-30T17:51:05Z

This is a proposal for representing SQL Instance and KV Node IDs in
the CockroachDB code.

We've had discussions around this as part of various PRs, but it would be useful to agree on the big picture separately.

Release note: None

cockroach-teamcity · 2021-11-30T17:51:12Z

This change is

stevendanna · 2021-12-01T15:52:01Z

docs/RFCS/20211130_node_id_refactoring.md

+```go
+// KVNodeID refers to a KV node.
+// Valid IDs start with 1.
+type KVNodeID int32


Minor nit: Using a struct wrapper here (type KVNodeID struct { id int32 }) and for the other types would disallow mistakes caused by untyped literals being used as function arguments.

stevendanna · 2021-12-01T16:06:26Z

docs/RFCS/20211130_node_id_refactoring.md

+// GenericNodeID contains either a KVNodeID or a SQLNodeID. It is meant to be
+// used in code that is used for both cases and is generally agnostic as to what
+// the ID means (e.g. logging code).
+// The zero value is not a valid ID. The internal integer value is positive for
+// KVNodeIDs and negative for SQLNodeIDs.
+type GenericNodeID int32


Have we considered the alternative of using an interface that both KVNodeID and SQLNodeID would implement? This RFC is concerned with unsafe implicit conversions between KVNodeID and SQLNodeID but why do we also want people to have to explicitly convert into a generic ID when calling a function that takes a generic ID?

ajwerner · 2021-12-01T16:44:02Z

docs/RFCS/20211130_node_id_refactoring.md

+- Cockroach Issue: (one or more # from the issue tracker)
+
+
+# Background


There's a sad piece of complexity I want to pile onto all of this. Hopeful we can make that complexity superficial only. Today and in this RFC there's an assumption that we're content to have all KV nodes use their node ID as the one and only ID for ~everything and then in SQL we'll use the SQLInstanceID. The SQLInstanceID used in tenant pods is leased to always provide small (<16 bit) values. The problem with node IDs is that they can get big. This breaks unique_rowid. The fix I'm interested in is having the sql nodes also grab an instance ID using the same protocol we added and use that for unique_rowid.

Anyway, if we do that, do we want a concept of a leased id as something different from either of these for use just in unique_rowid?

#47470

RaduBerinde · 2021-12-01T20:23:58Z

docs/RFCS/20211130_node_id_refactoring.md, line 39 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Minor nit: Using a struct wrapper here (type KVNodeID struct { id int32 }) and for the other types would disallow mistakes caused by untyped literals being used as function arguments.

It's true, I considered that. The problem is the existing usage of NodeID in protobufs, which we want to keep wire-compatible. I think it can be done with some custom marshalling function.

RaduBerinde · 2021-12-01T20:38:37Z

docs/RFCS/20211130_node_id_refactoring.md, line 9 at r1 (raw file):

Previously, ajwerner wrote…

There's a sad piece of complexity I want to pile onto all of this. Hopeful we can make that complexity superficial only. Today and in this RFC there's an assumption that we're content to have all KV nodes use their node ID as the one and only ID for ~everything and then in SQL we'll use the SQLInstanceID. The SQLInstanceID used in tenant pods is leased to always provide small (<16 bit) values. The problem with node IDs is that they can get big. This breaks unique_rowid. The fix I'm interested in is having the sql nodes also grab an instance ID using the same protocol we added and use that for unique_rowid.

Anyway, if we do that, do we want a concept of a leased id as something different from either of these for use just in unique_rowid?

#47470

I think I'd say that we should tolerate arbitrary SQLInstanceIDs and use the same mechanism to lease them, but tweaked so that we try to get the SQL Instance ID that equals the node ID (when it is small enough). That way in practical and testing scenarios the mapping stays trivial.

I am a bit worried in general that losing access to the sql instances table means that SQL nodes have to effectively shut down for all queries.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, @RaduBerinde, @rimadeodhar, @stevendanna, and @tbg)

docs/RFCS/20211130_node_id_refactoring.md, line 9 at r1 (raw file):

I am a bit worried in general that losing access to the sql instances table means that SQL nodes have to effectively shut down for all queries.

After the initial write to the instances table, the only thing you have to worry about losing access to shutting down queries is the sqlliveness table. That already would prevent jobs from working. It will also in the not too distant future also affect all sql queries if/when we adopt it for descriptor leases. As it stands, that's already the case for system.leases. The key difference is that system.leases defaults to 5m timeout and sqlliveness defaults to 1m. If that alone causes anxiety, I have no concerns about changing the sqlliveness timeout to also be 5m, at which point, the number of points of failure and the allowed failure duration would be the same.

FWIW, node liveness failures affect availability of legitimately all of KV and it has a timeout of 9s IIRC. I don't think the fault tolerance table of node liveness differs from that of sqlliveness in the host tenant.

RaduBerinde · 2021-12-05T14:55:38Z

docs/RFCS/20211130_node_id_refactoring.md, line 61 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Have we considered the alternative of using an interface that both KVNodeID and SQLNodeID would implement? This RFC is concerned with unsafe implicit conversions between KVNodeID and SQLNodeID but why do we also want people to have to explicitly convert into a generic ID when calling a function that takes a generic ID?

I am open to that. I'd love to hear @knz's take who has a better idea of all the code that would use the generic interface.

knz

LGTM modulo the specific naming of the types which I still consider an open question (but we'll process the team's input on that this week)

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @RaduBerinde, @rimadeodhar, @stevendanna, and @tbg)

docs/RFCS/20211130_node_id_refactoring.md, line 18 at r1 (raw file):

in the context of a KV RPC or a SQL (eg DistSQL) RPC. With multi-tenant support,
a new type `base.SQLInstanceID` was added. Various pieces of the code were
refactored to use this ID instead. However, there is a lot of remaining code

I wouldn't call it "a lot". I think the text of the RFC can be more specific here:

through the RPC code and the node-to-node (or SQL-to-SQL) external APIs, the code does not need to know the conceptual difference between them, and sometimes uses the NodeID type for both out of convenience (we want to avoid code duplication)
in DistSQL we need to introduce new concepts to replace NodeID to plan distributed queries across multiple SQL instances.

I don't know of other areas that mixes and matches the two types. We were pretty thorough the first time around when we introduced SQLInstanceID.

docs/RFCS/20211130_node_id_refactoring.md, line 42 at r1 (raw file):

func (id KVNodeID) String() string {
  return fmt.Sprintf("kv%d", id)

I recommend we keep the n prefix - we have a lot of support tooling that expects it.

docs/RFCS/20211130_node_id_refactoring.md, line 51 at r1 (raw file):

type SQLNodeID int32

func (id KVNodeID) String() string {

you mean SQLNodeID here?

docs/RFCS/20211130_node_id_refactoring.md, line 56 at r1 (raw file):

// GenericNodeID contains either a KVNodeID or a SQLNodeID. It is meant to be

Based on our prior convo, I'd be open to "ServerID" too.

docs/RFCS/20211130_node_id_refactoring.md, line 61 at r1 (raw file):

Previously, RaduBerinde wrote…

I am open to that. I'd love to hear @knz's take who has a better idea of all the code that would use the generic interface.

We'd need to do a performance analysis. The RPC code mostly uses the generic type, and I'm afraid of indirecting through interfaces on a performance hotpath.

NB: we can introduce explicit conv functions now, and flatten to an interface later. This choices does not paint us into a corner.

docs/RFCS/20211130_node_id_refactoring.md, line 70 at r1 (raw file):

    return SQLNodeID(-id).String()
  default:
    return "0" // or "uninitialized"

We use "?" here.

docs/RFCS/20211130_node_id_refactoring.md, line 102 at r1 (raw file):

## Implementation proposal

Implementing the proposed state involves changing a lot of existing code. We

I'm mildly hopeful this won't be "a lot". We'll give it a try in December.

This is a proposal for representing SQL Instance and KV Node IDs in the CockroachDB code. Release note: None

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, @rimadeodhar, @stevendanna, and @tbg)

docs/RFCS/20211130_node_id_refactoring.md, line 18 at r1 (raw file):

Previously, knz (kena) wrote…

I wouldn't call it "a lot". I think the text of the RFC can be more specific here:

through the RPC code and the node-to-node (or SQL-to-SQL) external APIs, the code does not need to know the conceptual difference between them, and sometimes uses the NodeID type for both out of convenience (we want to avoid code duplication)

in DistSQL we need to introduce new concepts to replace NodeID to plan distributed queries across multiple SQL instances.

I don't know of other areas that mixes and matches the two types. We were pretty thorough the first time around when we introduced SQLInstanceID.

Done.

docs/RFCS/20211130_node_id_refactoring.md, line 42 at r1 (raw file):

Previously, knz (kena) wrote…

I recommend we keep the n prefix - we have a lot of support tooling that expects it.

Done.

docs/RFCS/20211130_node_id_refactoring.md, line 51 at r1 (raw file):

Previously, knz (kena) wrote…

you mean SQLNodeID here?

Done.

docs/RFCS/20211130_node_id_refactoring.md, line 56 at r1 (raw file):

Previously, knz (kena) wrote…

Based on our prior convo, I'd be open to "ServerID" too.

Updated the naming section proposing specific candidates: ServerID / SQLInstanceID / KVNodeID

docs/RFCS/20211130_node_id_refactoring.md, line 61 at r1 (raw file):

Previously, knz (kena) wrote…

We'd need to do a performance analysis. The RPC code mostly uses the generic type, and I'm afraid of indirecting through interfaces on a performance hotpath.

NB: we can introduce explicit conv functions now, and flatten to an interface later. This choices does not paint us into a corner.

Made a note of this in an Alternatives section.

docs/RFCS/20211130_node_id_refactoring.md, line 70 at r1 (raw file):

Previously, knz (kena) wrote…

We use "?" here.

Done.

docs/RFCS/20211130_node_id_refactoring.md, line 102 at r1 (raw file):

Previously, knz (kena) wrote…

I'm mildly hopeful this won't be "a lot". We'll give it a try in December.

Rephrased a bit.

tbg · 2021-12-07T14:28:03Z

Haven't scrutinized in depth but the approach looks good to me.

rimadeodhar

LGTM as well.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)

AlexTalks

Two questions:

Is the plan to keep this in pkg/base, or in pkg/roachpb, where NodeID is currently defined?
What is the plan for refactoring wrappers, such as base.SQLIDContainer? (this may be out of scope here, just wanted to ask)

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)

RaduBerinde

Is the plan to keep this in pkg/base, or in pkg/roachpb, where NodeID is currently defined?

Not sure.. Probably base

What is the plan for refactoring wrappers, such as base.SQLIDContainer? (this may be out of scope here, just wanted to ask)

I think the details are TBD, but I'm thinking there is one container implementation which uses GenericNodeID and the KVNodeContainer and SQLNodeContainer add convenience methods on top of that.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)

AlexTalks

My main feedback about the location (which, again, may be out of scope of this mini RFC), is that putting it anywhere other than pkg/roachpb will limit its use in some places, since we can't use it as a replacement in a place where we might otherwise use NodeID.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)

knz · 2021-12-09T21:16:55Z

To me it was completely going without saying the new type would be in base/node_id.go, alongside (in replacement to) the other types currently.

knz · 2021-12-09T21:20:11Z

Sorry I meant the base type in roachpb, the container in base, as currently.

RaduBerinde

Ah my mistake, I didn't realize base is actually on top of roachpb 😅

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)

AlexTalks

And storage/enginepb is below even that 😕 despite the fact that we are adding what perhaps could be a GenericNodeID there in #73500, I'm not sure it makes sense to refactor these types down to that level in the dependency chain, especially for what is essentially an opaque int32 to lower levels of the stack.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)

knz

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, @RaduBerinde, and @stevendanna)

docs/RFCS/20211130_node_id_refactoring.md, line 104 at r2 (raw file):

       which sets up the `SQLNodeID` to equal the `KVNodeID`.
     * the DistSQL planner which knows that we need to deploy a flow to the
       `SQLNodeID` with the same numeric value as the leasehoder's `KVNodeID`.

nit: add a qualifier to this sentence "in non-multitenant deployments"

(otherwise, I am curious as to what the plan is wrt planning distsql flows on multitenant clusters)

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)

docs/RFCS/20211130_node_id_refactoring.md, line 104 at r2 (raw file):

Previously, knz (kena) wrote…

nit: add a qualifier to this sentence "in non-multitenant deployments"

(otherwise, I am curious as to what the plan is wrt planning distsql flows on multitenant clusters)

Isn't the sentence below enough ("this code does not run for non-system tenants")?

There are no definite plans yet for tenants. One possibility is to pick sql instances on different AZs and split ranges according to the leaseholder AZs. In any case, there will be one piece of code which will use the leaseholder information (KVNodeID) as input and produce SQLNodeIDs as output.

knz · 2021-12-15T13:41:48Z

ok yes that works

RaduBerinde · 2021-12-15T17:09:07Z

Should I merge this or keep it open while implementation is underway? If merge, should we finalize the names? KVNodeID, SQLInstanceID, ServerID?

RaduBerinde · 2021-12-16T19:00:43Z

We discussed in the team meeting that we can merge and can always change it after the fact if we find better names. TFTRs!

bors r+

craig · 2021-12-16T19:30:34Z

Build failed (retrying...):

GitHub CI (Cockroach)

73309: rfc: mini RFC for refactoring node IDs r=RaduBerinde a=RaduBerinde This is a proposal for representing SQL Instance and KV Node IDs in the CockroachDB code. We've had discussions around this as part of various PRs, but it would be useful to agree on the big picture separately. Release note: None Co-authored-by: Radu Berinde <radu@cockroachlabs.com>

craig · 2021-12-16T21:04:45Z

Build failed:

GitHub CI (Cockroach)

AlexTalks

Perhaps not directly described in this RFC, but wanted to comment here that if at all possible, KV would like to request that we refactor these ID types to be in a separate package that has no dependencies (e.g. pkg/util/nodeid) so that they can be used in any package, similar to how HLC Timestamps are defined in pkg/util/hlc/timestamp.proto. This way, they can be used in packages that cannot depend on roachpb (such as storage/enginepb).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)

RaduBerinde · 2021-12-17T17:30:34Z

^ That makes sense to me.

bors r+

craig · 2021-12-17T18:32:29Z

Build succeeded:

GitHub CI (Cockroach)

RaduBerinde requested review from ajwerner, knz, nvb, stevendanna and tbg November 30, 2021 17:51

RaduBerinde requested a review from a team as a code owner November 30, 2021 17:51

RaduBerinde force-pushed the node-id-rfc branch from a10fcec to 4e82767 Compare November 30, 2021 18:54

RaduBerinde requested a review from rimadeodhar November 30, 2021 23:35

stevendanna reviewed Dec 1, 2021

View reviewed changes

ajwerner reviewed Dec 1, 2021

View reviewed changes

knz approved these changes Dec 6, 2021

View reviewed changes

rfc: mini RFC for refactoring node IDs

3aa6473

This is a proposal for representing SQL Instance and KV Node IDs in the CockroachDB code. Release note: None

RaduBerinde force-pushed the node-id-rfc branch from 4e82767 to 3aa6473 Compare December 6, 2021 21:16

RaduBerinde commented Dec 6, 2021

View reviewed changes

tbg removed their request for review December 7, 2021 14:27

rimadeodhar reviewed Dec 7, 2021

View reviewed changes

AlexTalks reviewed Dec 8, 2021

View reviewed changes

RaduBerinde commented Dec 8, 2021

View reviewed changes

AlexTalks reviewed Dec 9, 2021

View reviewed changes

RaduBerinde commented Dec 9, 2021

View reviewed changes

AlexTalks reviewed Dec 10, 2021

View reviewed changes

knz reviewed Dec 14, 2021

View reviewed changes

RaduBerinde commented Dec 14, 2021

View reviewed changes

AlexTalks reviewed Dec 17, 2021

View reviewed changes

craig bot merged commit a7ae55e into cockroachdb:master Dec 17, 2021

RaduBerinde mentioned this pull request Dec 17, 2021

roachpb: make TenantID a proto message #73677

Merged

RaduBerinde deleted the node-id-rfc branch December 21, 2021 21:02

		- Cockroach Issue: (one or more # from the issue tracker)


		# Background

Conversation

RaduBerinde commented Nov 30, 2021

Uh oh!

cockroach-teamcity commented Nov 30, 2021

Uh oh!

stevendanna Dec 1, 2021

Choose a reason for hiding this comment

Uh oh!

stevendanna Dec 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajwerner Dec 1, 2021

Choose a reason for hiding this comment

Uh oh!

RaduBerinde commented Dec 1, 2021

Uh oh!

RaduBerinde commented Dec 1, 2021

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde commented Dec 5, 2021

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

tbg commented Dec 7, 2021

Uh oh!

rimadeodhar left a comment

Choose a reason for hiding this comment

Uh oh!

AlexTalks left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

AlexTalks left a comment

Choose a reason for hiding this comment

Uh oh!

knz commented Dec 9, 2021

Uh oh!

knz commented Dec 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

AlexTalks left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

knz commented Dec 15, 2021

Uh oh!

RaduBerinde commented Dec 15, 2021

Uh oh!

RaduBerinde commented Dec 16, 2021

Uh oh!

craig bot commented Dec 16, 2021

Uh oh!

craig bot commented Dec 16, 2021

Uh oh!

AlexTalks left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde commented Dec 17, 2021

Uh oh!

craig bot commented Dec 17, 2021

Uh oh!

Reviewers

stevendanna Dec 1, 2021 •

edited

Loading

knz commented Dec 9, 2021 •

edited

Loading

AlexTalks left a comment •

edited

Loading