Skip to content

rfc: mini RFC for refactoring node IDs#73309

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
RaduBerinde:node-id-rfc
Dec 17, 2021
Merged

rfc: mini RFC for refactoring node IDs#73309
craig[bot] merged 1 commit intocockroachdb:masterfrom
RaduBerinde:node-id-rfc

Conversation

@RaduBerinde
Copy link
Copy Markdown
Member

This is a proposal for representing SQL Instance and KV Node IDs in
the CockroachDB code.

We've had discussions around this as part of various PRs, but it would be useful to agree on the big picture separately.

Release note: None

@RaduBerinde RaduBerinde requested a review from a team as a code owner November 30, 2021 17:51
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

```go
// KVNodeID refers to a KV node.
// Valid IDs start with 1.
type KVNodeID int32
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: Using a struct wrapper here (type KVNodeID struct { id int32 }) and for the other types would disallow mistakes caused by untyped literals being used as function arguments.

Comment on lines +56 to +70
// GenericNodeID contains either a KVNodeID or a SQLNodeID. It is meant to be
// used in code that is used for both cases and is generally agnostic as to what
// the ID means (e.g. logging code).
// The zero value is not a valid ID. The internal integer value is positive for
// KVNodeIDs and negative for SQLNodeIDs.
type GenericNodeID int32
Copy link
Copy Markdown
Collaborator

@stevendanna stevendanna Dec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered the alternative of using an interface that both KVNodeID and SQLNodeID would implement? This RFC is concerned with unsafe implicit conversions between KVNodeID and SQLNodeID but why do we also want people to have to explicitly convert into a generic ID when calling a function that takes a generic ID?

- Cockroach Issue: (one or more # from the issue tracker)


# Background
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a sad piece of complexity I want to pile onto all of this. Hopeful we can make that complexity superficial only. Today and in this RFC there's an assumption that we're content to have all KV nodes use their node ID as the one and only ID for ~everything and then in SQL we'll use the SQLInstanceID. The SQLInstanceID used in tenant pods is leased to always provide small (<16 bit) values. The problem with node IDs is that they can get big. This breaks unique_rowid. The fix I'm interested in is having the sql nodes also grab an instance ID using the same protocol we added and use that for unique_rowid.

Anyway, if we do that, do we want a concept of a leased id as something different from either of these for use just in unique_rowid?

#47470

@RaduBerinde
Copy link
Copy Markdown
Member Author


docs/RFCS/20211130_node_id_refactoring.md, line 39 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Minor nit: Using a struct wrapper here (type KVNodeID struct { id int32 }) and for the other types would disallow mistakes caused by untyped literals being used as function arguments.

It's true, I considered that. The problem is the existing usage of NodeID in protobufs, which we want to keep wire-compatible. I think it can be done with some custom marshalling function.

@RaduBerinde
Copy link
Copy Markdown
Member Author


docs/RFCS/20211130_node_id_refactoring.md, line 9 at r1 (raw file):

Previously, ajwerner wrote…

There's a sad piece of complexity I want to pile onto all of this. Hopeful we can make that complexity superficial only. Today and in this RFC there's an assumption that we're content to have all KV nodes use their node ID as the one and only ID for ~everything and then in SQL we'll use the SQLInstanceID. The SQLInstanceID used in tenant pods is leased to always provide small (<16 bit) values. The problem with node IDs is that they can get big. This breaks unique_rowid. The fix I'm interested in is having the sql nodes also grab an instance ID using the same protocol we added and use that for unique_rowid.

Anyway, if we do that, do we want a concept of a leased id as something different from either of these for use just in unique_rowid?

#47470

I think I'd say that we should tolerate arbitrary SQLInstanceIDs and use the same mechanism to lease them, but tweaked so that we try to get the SQL Instance ID that equals the node ID (when it is small enough). That way in practical and testing scenarios the mapping stays trivial.

I am a bit worried in general that losing access to the sql instances table means that SQL nodes have to effectively shut down for all queries.

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, @RaduBerinde, @rimadeodhar, @stevendanna, and @tbg)


docs/RFCS/20211130_node_id_refactoring.md, line 9 at r1 (raw file):

I am a bit worried in general that losing access to the sql instances table means that SQL nodes have to effectively shut down for all queries.

After the initial write to the instances table, the only thing you have to worry about losing access to shutting down queries is the sqlliveness table. That already would prevent jobs from working. It will also in the not too distant future also affect all sql queries if/when we adopt it for descriptor leases. As it stands, that's already the case for system.leases. The key difference is that system.leases defaults to 5m timeout and sqlliveness defaults to 1m. If that alone causes anxiety, I have no concerns about changing the sqlliveness timeout to also be 5m, at which point, the number of points of failure and the allowed failure duration would be the same.

FWIW, node liveness failures affect availability of legitimately all of KV and it has a timeout of 9s IIRC. I don't think the fault tolerance table of node liveness differs from that of sqlliveness in the host tenant.

@RaduBerinde
Copy link
Copy Markdown
Member Author


docs/RFCS/20211130_node_id_refactoring.md, line 61 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Have we considered the alternative of using an interface that both KVNodeID and SQLNodeID would implement? This RFC is concerned with unsafe implicit conversions between KVNodeID and SQLNodeID but why do we also want people to have to explicitly convert into a generic ID when calling a function that takes a generic ID?

I am open to that. I'd love to hear @knz's take who has a better idea of all the code that would use the generic interface.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the specific naming of the types which I still consider an open question (but we'll process the team's input on that this week)

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @RaduBerinde, @rimadeodhar, @stevendanna, and @tbg)


docs/RFCS/20211130_node_id_refactoring.md, line 18 at r1 (raw file):

in the context of a KV RPC or a SQL (eg DistSQL) RPC. With multi-tenant support,
a new type `base.SQLInstanceID` was added. Various pieces of the code were
refactored to use this ID instead. However, there is a lot of remaining code

I wouldn't call it "a lot". I think the text of the RFC can be more specific here:

  • through the RPC code and the node-to-node (or SQL-to-SQL) external APIs, the code does not need to know the conceptual difference between them, and sometimes uses the NodeID type for both out of convenience (we want to avoid code duplication)
  • in DistSQL we need to introduce new concepts to replace NodeID to plan distributed queries across multiple SQL instances.

I don't know of other areas that mixes and matches the two types. We were pretty thorough the first time around when we introduced SQLInstanceID.


docs/RFCS/20211130_node_id_refactoring.md, line 42 at r1 (raw file):

func (id KVNodeID) String() string {
  return fmt.Sprintf("kv%d", id)

I recommend we keep the n prefix - we have a lot of support tooling that expects it.


docs/RFCS/20211130_node_id_refactoring.md, line 51 at r1 (raw file):

type SQLNodeID int32

func (id KVNodeID) String() string {

you mean SQLNodeID here?


docs/RFCS/20211130_node_id_refactoring.md, line 56 at r1 (raw file):

// GenericNodeID contains either a KVNodeID or a SQLNodeID. It is meant to be

Based on our prior convo, I'd be open to "ServerID" too.


docs/RFCS/20211130_node_id_refactoring.md, line 61 at r1 (raw file):

Previously, RaduBerinde wrote…

I am open to that. I'd love to hear @knz's take who has a better idea of all the code that would use the generic interface.

We'd need to do a performance analysis. The RPC code mostly uses the generic type, and I'm afraid of indirecting through interfaces on a performance hotpath.

NB: we can introduce explicit conv functions now, and flatten to an interface later. This choices does not paint us into a corner.


docs/RFCS/20211130_node_id_refactoring.md, line 70 at r1 (raw file):

    return SQLNodeID(-id).String()
  default:
    return "0" // or "uninitialized"

We use "?" here.


docs/RFCS/20211130_node_id_refactoring.md, line 102 at r1 (raw file):

## Implementation proposal

Implementing the proposed state involves changing a lot of existing code. We

I'm mildly hopeful this won't be "a lot". We'll give it a try in December.

This is a proposal for representing SQL Instance and KV Node IDs in
the CockroachDB code.

Release note: None
Copy link
Copy Markdown
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, @rimadeodhar, @stevendanna, and @tbg)


docs/RFCS/20211130_node_id_refactoring.md, line 18 at r1 (raw file):

Previously, knz (kena) wrote…

I wouldn't call it "a lot". I think the text of the RFC can be more specific here:

  • through the RPC code and the node-to-node (or SQL-to-SQL) external APIs, the code does not need to know the conceptual difference between them, and sometimes uses the NodeID type for both out of convenience (we want to avoid code duplication)
  • in DistSQL we need to introduce new concepts to replace NodeID to plan distributed queries across multiple SQL instances.

I don't know of other areas that mixes and matches the two types. We were pretty thorough the first time around when we introduced SQLInstanceID.

Done.


docs/RFCS/20211130_node_id_refactoring.md, line 42 at r1 (raw file):

Previously, knz (kena) wrote…

I recommend we keep the n prefix - we have a lot of support tooling that expects it.

Done.


docs/RFCS/20211130_node_id_refactoring.md, line 51 at r1 (raw file):

Previously, knz (kena) wrote…

you mean SQLNodeID here?

Done.


docs/RFCS/20211130_node_id_refactoring.md, line 56 at r1 (raw file):

Previously, knz (kena) wrote…

Based on our prior convo, I'd be open to "ServerID" too.

Updated the naming section proposing specific candidates: ServerID / SQLInstanceID / KVNodeID


docs/RFCS/20211130_node_id_refactoring.md, line 61 at r1 (raw file):

Previously, knz (kena) wrote…

We'd need to do a performance analysis. The RPC code mostly uses the generic type, and I'm afraid of indirecting through interfaces on a performance hotpath.

NB: we can introduce explicit conv functions now, and flatten to an interface later. This choices does not paint us into a corner.

Made a note of this in an Alternatives section.


docs/RFCS/20211130_node_id_refactoring.md, line 70 at r1 (raw file):

Previously, knz (kena) wrote…

We use "?" here.

Done.


docs/RFCS/20211130_node_id_refactoring.md, line 102 at r1 (raw file):

Previously, knz (kena) wrote…

I'm mildly hopeful this won't be "a lot". We'll give it a try in December.

Rephrased a bit.

@tbg tbg removed their request for review December 7, 2021 14:27
@tbg
Copy link
Copy Markdown
Member

tbg commented Dec 7, 2021

Haven't scrutinized in depth but the approach looks good to me.

Copy link
Copy Markdown
Collaborator

@rimadeodhar rimadeodhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)

Copy link
Copy Markdown
Contributor

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  1. Is the plan to keep this in pkg/base, or in pkg/roachpb, where NodeID is currently defined?
  2. What is the plan for refactoring wrappers, such as base.SQLIDContainer? (this may be out of scope here, just wanted to ask)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)

Copy link
Copy Markdown
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to keep this in pkg/base, or in pkg/roachpb, where NodeID is currently defined?

Not sure.. Probably base

  1. What is the plan for refactoring wrappers, such as base.SQLIDContainer? (this may be out of scope here, just wanted to ask)

I think the details are TBD, but I'm thinking there is one container implementation which uses GenericNodeID and the KVNodeContainer and SQLNodeContainer add convenience methods on top of that.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)

Copy link
Copy Markdown
Contributor

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main feedback about the location (which, again, may be out of scope of this mini RFC), is that putting it anywhere other than pkg/roachpb will limit its use in some places, since we can't use it as a replacement in a place where we might otherwise use NodeID.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)

@knz
Copy link
Copy Markdown
Contributor

knz commented Dec 9, 2021

To me it was completely going without saying the new type would be in base/node_id.go, alongside (in replacement to) the other types currently.

@knz
Copy link
Copy Markdown
Contributor

knz commented Dec 9, 2021

Sorry I meant the base type in roachpb, the container in base, as currently.

Copy link
Copy Markdown
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my mistake, I didn't realize base is actually on top of roachpb 😅

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)

Copy link
Copy Markdown
Contributor

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And storage/enginepb is below even that 😕 despite the fact that we are adding what perhaps could be a GenericNodeID there in #73500, I'm not sure it makes sense to refactor these types down to that level in the dependency chain, especially for what is essentially an opaque int32 to lower levels of the stack.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, @RaduBerinde, and @stevendanna)


docs/RFCS/20211130_node_id_refactoring.md, line 104 at r2 (raw file):

       which sets up the `SQLNodeID` to equal the `KVNodeID`.
     * the DistSQL planner which knows that we need to deploy a flow to the
       `SQLNodeID` with the same numeric value as the leasehoder's `KVNodeID`.

nit: add a qualifier to this sentence "in non-multitenant deployments"

(otherwise, I am curious as to what the plan is wrt planning distsql flows on multitenant clusters)

Copy link
Copy Markdown
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)


docs/RFCS/20211130_node_id_refactoring.md, line 104 at r2 (raw file):

Previously, knz (kena) wrote…

nit: add a qualifier to this sentence "in non-multitenant deployments"

(otherwise, I am curious as to what the plan is wrt planning distsql flows on multitenant clusters)

Isn't the sentence below enough ("this code does not run for non-system tenants")?

There are no definite plans yet for tenants. One possibility is to pick sql instances on different AZs and split ranges according to the leaseholder AZs. In any case, there will be one piece of code which will use the leaseholder information (KVNodeID) as input and produce SQLNodeIDs as output.

@knz
Copy link
Copy Markdown
Contributor

knz commented Dec 15, 2021

ok yes that works

@RaduBerinde
Copy link
Copy Markdown
Member Author

Should I merge this or keep it open while implementation is underway? If merge, should we finalize the names? KVNodeID, SQLInstanceID, ServerID?

@RaduBerinde
Copy link
Copy Markdown
Member Author

We discussed in the team meeting that we can merge and can always change it after the fact if we find better names. TFTRs!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Dec 16, 2021

Build failed (retrying...):

craig bot pushed a commit that referenced this pull request Dec 16, 2021
73309: rfc: mini RFC for refactoring node IDs r=RaduBerinde a=RaduBerinde

This is a proposal for representing SQL Instance and KV Node IDs in
the CockroachDB code.

We've had discussions around this as part of various PRs, but it would be useful to agree on the big picture separately.

Release note: None

Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Dec 16, 2021

Build failed:

Copy link
Copy Markdown
Contributor

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps not directly described in this RFC, but wanted to comment here that if at all possible, KV would like to request that we refactor these ID types to be in a separate package that has no dependencies (e.g. pkg/util/nodeid) so that they can be used in any package, similar to how HLC Timestamps are defined in pkg/util/hlc/timestamp.proto. This way, they can be used in packages that cannot depend on roachpb (such as storage/enginepb).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)

@RaduBerinde
Copy link
Copy Markdown
Member Author

^ That makes sense to me.

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Dec 17, 2021

Build succeeded:

@craig craig bot merged commit a7ae55e into cockroachdb:master Dec 17, 2021
@RaduBerinde RaduBerinde deleted the node-id-rfc branch December 21, 2021 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants