rfc: mini RFC for refactoring node IDs#73309
Conversation
a10fcec to
4e82767
Compare
| ```go | ||
| // KVNodeID refers to a KV node. | ||
| // Valid IDs start with 1. | ||
| type KVNodeID int32 |
There was a problem hiding this comment.
Minor nit: Using a struct wrapper here (type KVNodeID struct { id int32 }) and for the other types would disallow mistakes caused by untyped literals being used as function arguments.
| // GenericNodeID contains either a KVNodeID or a SQLNodeID. It is meant to be | ||
| // used in code that is used for both cases and is generally agnostic as to what | ||
| // the ID means (e.g. logging code). | ||
| // The zero value is not a valid ID. The internal integer value is positive for | ||
| // KVNodeIDs and negative for SQLNodeIDs. | ||
| type GenericNodeID int32 |
There was a problem hiding this comment.
Have we considered the alternative of using an interface that both KVNodeID and SQLNodeID would implement? This RFC is concerned with unsafe implicit conversions between KVNodeID and SQLNodeID but why do we also want people to have to explicitly convert into a generic ID when calling a function that takes a generic ID?
| - Cockroach Issue: (one or more # from the issue tracker) | ||
|
|
||
|
|
||
| # Background |
There was a problem hiding this comment.
There's a sad piece of complexity I want to pile onto all of this. Hopeful we can make that complexity superficial only. Today and in this RFC there's an assumption that we're content to have all KV nodes use their node ID as the one and only ID for ~everything and then in SQL we'll use the SQLInstanceID. The SQLInstanceID used in tenant pods is leased to always provide small (<16 bit) values. The problem with node IDs is that they can get big. This breaks unique_rowid. The fix I'm interested in is having the sql nodes also grab an instance ID using the same protocol we added and use that for unique_rowid.
Anyway, if we do that, do we want a concept of a leased id as something different from either of these for use just in unique_rowid?
|
docs/RFCS/20211130_node_id_refactoring.md, line 39 at r1 (raw file): Previously, stevendanna (Steven Danna) wrote…
It's true, I considered that. The problem is the existing usage of |
|
docs/RFCS/20211130_node_id_refactoring.md, line 9 at r1 (raw file): Previously, ajwerner wrote…
I think I'd say that we should tolerate arbitrary SQLInstanceIDs and use the same mechanism to lease them, but tweaked so that we try to get the SQL Instance ID that equals the node ID (when it is small enough). That way in practical and testing scenarios the mapping stays trivial. I am a bit worried in general that losing access to the sql instances table means that SQL nodes have to effectively shut down for all queries. |
ajwerner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, @RaduBerinde, @rimadeodhar, @stevendanna, and @tbg)
docs/RFCS/20211130_node_id_refactoring.md, line 9 at r1 (raw file):
I am a bit worried in general that losing access to the sql instances table means that SQL nodes have to effectively shut down for all queries.
After the initial write to the instances table, the only thing you have to worry about losing access to shutting down queries is the sqlliveness table. That already would prevent jobs from working. It will also in the not too distant future also affect all sql queries if/when we adopt it for descriptor leases. As it stands, that's already the case for system.leases. The key difference is that system.leases defaults to 5m timeout and sqlliveness defaults to 1m. If that alone causes anxiety, I have no concerns about changing the sqlliveness timeout to also be 5m, at which point, the number of points of failure and the allowed failure duration would be the same.
FWIW, node liveness failures affect availability of legitimately all of KV and it has a timeout of 9s IIRC. I don't think the fault tolerance table of node liveness differs from that of sqlliveness in the host tenant.
|
docs/RFCS/20211130_node_id_refactoring.md, line 61 at r1 (raw file): Previously, stevendanna (Steven Danna) wrote…
I am open to that. I'd love to hear @knz's take who has a better idea of all the code that would use the generic interface. |
knz
left a comment
There was a problem hiding this comment.
LGTM modulo the specific naming of the types which I still consider an open question (but we'll process the team's input on that this week)
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @RaduBerinde, @rimadeodhar, @stevendanna, and @tbg)
docs/RFCS/20211130_node_id_refactoring.md, line 18 at r1 (raw file):
in the context of a KV RPC or a SQL (eg DistSQL) RPC. With multi-tenant support, a new type `base.SQLInstanceID` was added. Various pieces of the code were refactored to use this ID instead. However, there is a lot of remaining code
I wouldn't call it "a lot". I think the text of the RFC can be more specific here:
- through the RPC code and the node-to-node (or SQL-to-SQL) external APIs, the code does not need to know the conceptual difference between them, and sometimes uses the NodeID type for both out of convenience (we want to avoid code duplication)
- in DistSQL we need to introduce new concepts to replace NodeID to plan distributed queries across multiple SQL instances.
I don't know of other areas that mixes and matches the two types. We were pretty thorough the first time around when we introduced SQLInstanceID.
docs/RFCS/20211130_node_id_refactoring.md, line 42 at r1 (raw file):
func (id KVNodeID) String() string { return fmt.Sprintf("kv%d", id)
I recommend we keep the n prefix - we have a lot of support tooling that expects it.
docs/RFCS/20211130_node_id_refactoring.md, line 51 at r1 (raw file):
type SQLNodeID int32 func (id KVNodeID) String() string {
you mean SQLNodeID here?
docs/RFCS/20211130_node_id_refactoring.md, line 56 at r1 (raw file):
// GenericNodeID contains either a KVNodeID or a SQLNodeID. It is meant to be
Based on our prior convo, I'd be open to "ServerID" too.
docs/RFCS/20211130_node_id_refactoring.md, line 61 at r1 (raw file):
Previously, RaduBerinde wrote…
I am open to that. I'd love to hear @knz's take who has a better idea of all the code that would use the generic interface.
We'd need to do a performance analysis. The RPC code mostly uses the generic type, and I'm afraid of indirecting through interfaces on a performance hotpath.
NB: we can introduce explicit conv functions now, and flatten to an interface later. This choices does not paint us into a corner.
docs/RFCS/20211130_node_id_refactoring.md, line 70 at r1 (raw file):
return SQLNodeID(-id).String() default: return "0" // or "uninitialized"
We use "?" here.
docs/RFCS/20211130_node_id_refactoring.md, line 102 at r1 (raw file):
## Implementation proposal Implementing the proposed state involves changing a lot of existing code. We
I'm mildly hopeful this won't be "a lot". We'll give it a try in December.
This is a proposal for representing SQL Instance and KV Node IDs in the CockroachDB code. Release note: None
4e82767 to
3aa6473
Compare
RaduBerinde
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, @rimadeodhar, @stevendanna, and @tbg)
docs/RFCS/20211130_node_id_refactoring.md, line 18 at r1 (raw file):
Previously, knz (kena) wrote…
I wouldn't call it "a lot". I think the text of the RFC can be more specific here:
- through the RPC code and the node-to-node (or SQL-to-SQL) external APIs, the code does not need to know the conceptual difference between them, and sometimes uses the NodeID type for both out of convenience (we want to avoid code duplication)
- in DistSQL we need to introduce new concepts to replace NodeID to plan distributed queries across multiple SQL instances.
I don't know of other areas that mixes and matches the two types. We were pretty thorough the first time around when we introduced SQLInstanceID.
Done.
docs/RFCS/20211130_node_id_refactoring.md, line 42 at r1 (raw file):
Previously, knz (kena) wrote…
I recommend we keep the
nprefix - we have a lot of support tooling that expects it.
Done.
docs/RFCS/20211130_node_id_refactoring.md, line 51 at r1 (raw file):
Previously, knz (kena) wrote…
you mean
SQLNodeIDhere?
Done.
docs/RFCS/20211130_node_id_refactoring.md, line 56 at r1 (raw file):
Previously, knz (kena) wrote…
Based on our prior convo, I'd be open to "ServerID" too.
Updated the naming section proposing specific candidates: ServerID / SQLInstanceID / KVNodeID
docs/RFCS/20211130_node_id_refactoring.md, line 61 at r1 (raw file):
Previously, knz (kena) wrote…
We'd need to do a performance analysis. The RPC code mostly uses the generic type, and I'm afraid of indirecting through interfaces on a performance hotpath.
NB: we can introduce explicit conv functions now, and flatten to an interface later. This choices does not paint us into a corner.
Made a note of this in an Alternatives section.
docs/RFCS/20211130_node_id_refactoring.md, line 70 at r1 (raw file):
Previously, knz (kena) wrote…
We use "?" here.
Done.
docs/RFCS/20211130_node_id_refactoring.md, line 102 at r1 (raw file):
Previously, knz (kena) wrote…
I'm mildly hopeful this won't be "a lot". We'll give it a try in December.
Rephrased a bit.
|
Haven't scrutinized in depth but the approach looks good to me. |
rimadeodhar
left a comment
There was a problem hiding this comment.
LGTM as well.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)
AlexTalks
left a comment
There was a problem hiding this comment.
Two questions:
- Is the plan to keep this in
pkg/base, or inpkg/roachpb, whereNodeIDis currently defined? - What is the plan for refactoring wrappers, such as
base.SQLIDContainer? (this may be out of scope here, just wanted to ask)
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)
RaduBerinde
left a comment
There was a problem hiding this comment.
Is the plan to keep this in pkg/base, or in pkg/roachpb, where NodeID is currently defined?
Not sure.. Probably base
- What is the plan for refactoring wrappers, such as base.SQLIDContainer? (this may be out of scope here, just wanted to ask)
I think the details are TBD, but I'm thinking there is one container implementation which uses GenericNodeID and the KVNodeContainer and SQLNodeContainer add convenience methods on top of that.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)
AlexTalks
left a comment
There was a problem hiding this comment.
My main feedback about the location (which, again, may be out of scope of this mini RFC), is that putting it anywhere other than pkg/roachpb will limit its use in some places, since we can't use it as a replacement in a place where we might otherwise use NodeID.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)
|
To me it was completely going without saying the new type would be in |
|
Sorry I meant the base type in |
RaduBerinde
left a comment
There was a problem hiding this comment.
Ah my mistake, I didn't realize base is actually on top of roachpb 😅
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)
There was a problem hiding this comment.
And storage/enginepb is below even that 😕 despite the fact that we are adding what perhaps could be a GenericNodeID there in #73500, I'm not sure it makes sense to refactor these types down to that level in the dependency chain, especially for what is essentially an opaque int32 to lower levels of the stack.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)
knz
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, @RaduBerinde, and @stevendanna)
docs/RFCS/20211130_node_id_refactoring.md, line 104 at r2 (raw file):
which sets up the `SQLNodeID` to equal the `KVNodeID`. * the DistSQL planner which knows that we need to deploy a flow to the `SQLNodeID` with the same numeric value as the leasehoder's `KVNodeID`.
nit: add a qualifier to this sentence "in non-multitenant deployments"
(otherwise, I am curious as to what the plan is wrt planning distsql flows on multitenant clusters)
RaduBerinde
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)
docs/RFCS/20211130_node_id_refactoring.md, line 104 at r2 (raw file):
Previously, knz (kena) wrote…
nit: add a qualifier to this sentence "in non-multitenant deployments"
(otherwise, I am curious as to what the plan is wrt planning distsql flows on multitenant clusters)
Isn't the sentence below enough ("this code does not run for non-system tenants")?
There are no definite plans yet for tenants. One possibility is to pick sql instances on different AZs and split ranges according to the leaseholder AZs. In any case, there will be one piece of code which will use the leaseholder information (KVNodeID) as input and produce SQLNodeIDs as output.
|
ok yes that works |
|
Should I merge this or keep it open while implementation is underway? If merge, should we finalize the names? |
|
We discussed in the team meeting that we can merge and can always change it after the fact if we find better names. TFTRs! bors r+ |
|
Build failed (retrying...): |
73309: rfc: mini RFC for refactoring node IDs r=RaduBerinde a=RaduBerinde This is a proposal for representing SQL Instance and KV Node IDs in the CockroachDB code. We've had discussions around this as part of various PRs, but it would be useful to agree on the big picture separately. Release note: None Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
|
Build failed: |
AlexTalks
left a comment
There was a problem hiding this comment.
Perhaps not directly described in this RFC, but wanted to comment here that if at all possible, KV would like to request that we refactor these ID types to be in a separate package that has no dependencies (e.g. pkg/util/nodeid) so that they can be used in any package, similar to how HLC Timestamps are defined in pkg/util/hlc/timestamp.proto. This way, they can be used in packages that cannot depend on roachpb (such as storage/enginepb).
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @stevendanna)
|
^ That makes sense to me. bors r+ |
|
Build succeeded: |
This is a proposal for representing SQL Instance and KV Node IDs in
the CockroachDB code.
We've had discussions around this as part of various PRs, but it would be useful to agree on the big picture separately.
Release note: None