kvserver,ui: remove usage of NodeLivenessProto

Let us consider the NodeLivenessProto. It enumerates across two distinct set of things: the "membership" status of the corresponding node as it pertains to the cluster (live/active, decommissioning, decommissioned), and the "process" state of that node as seen by the rest of the cluster (live/active, unavailable, dead). "unavailable" is the state we use to represent a node that has only recently started misbehaving, and one that we're unable to contact. We use this intermediate stage for a fixed `deadThreshold time.Duration` before marking it as "dead". It's possible for two of these "states" to be true simultaneously.  Consider when we decide to decommission a dead node, as per this enum it can be considered as both "unavailable" _and_ "decommissioning". It makes for confusing semantics, and code attempting to disambiguate across these possible states (in [`kvserver.LivenessStatus`](https://github.com/cockroachdb/cockroach/blob/65bff3ec6c8ca9/pkg/kv/kvserver/store_pool.go#L114-L153)) is wholly arbitrary (it takes in this `deadThreshold time.Duration` to make the determination when a node transitions from being simply unavailable, to being considered dead).


There's no good reason for the "deadness" of a node to be something that's proto-backed. For callers interested in whether or not a node is considered "dead" or "unavailable", or "live", it should suffice for clients to capture the given node's liveness record (which includes the expiration time), compare with current time, and use just that for it's determination of "deadness" or "unavailability". That aside, the node we mark as "dead" may not actually be dead. According to this API, we consider a node "dead" when it hasn't been able to hearbeat it's liveness record for `deadThreshold` (which defaults to 5m) after it was set to expire. It could very well be that the node is behind a persistent network failure, so still up and running.

Now, lets consider all the current consumers of this `NodeLivenessProto` enum, and how we'd go about changing them. Use of this enum is primarily exposed through our [StatusServer](https://github.com/cockroachdb/cockroach/blob/65bff3ec6c8ca9ed8e48501d98d79c8c18b84e8b/pkg/server/serverpb/status.proto#L895-L899) interface:

(a) `cockroach {gen haproxy,debug zip}`: Uses the StatusServer API to poll the targeted node for membership status, in order to skip over decommissioning nodes

(b) `cockroach node {recommission,decommission}`: Uses the StatusServer API to poll the targeted nodes for membership status, to generate warnings for users (when re-decommissioning a node, etc.)

(c) `kvserver.Allocator`: Talking through `storeDetail.status`, and `kvserver.LivenessStatus`, it uses both "membership" status and "process" state to inform placement of replicas.

(d) Admin UI: It uses the StatusServer process to poll the "process" state of given nodes, which powers the "unvailable"/"dead" view in the Admin UI.

Now, for our proposal: Let's get rid of `NodeLivenessProto`. We're working separately [#50329](https://github.com/cockroachdb/cockroach/pull/50329) to introduce a decommissioned bit to CRDB, which does not make use of this existing infrastructure. For (a) and (b), since they depend on having a view of the decommissioning/decommissioned status of a given node, it should suffice to simply consult this new API. For (c), like suggested above, we can simply fetch the liveness record of interest (which captures the expiration time of said liveness record), and use the current time to determine whether or not a given store is a viable placement target for replicas. It can be well insulated between `storeDetail.status`. For (d), we can expose a view of the liveness record (which after [#50329](https://github.com/cockroachdb/cockroach/pull/50329) will contain the "membership" status as well) and use the expiration time in the Admin UI code to determine "dead"-ness. In fact, digging through _why_ we introduced this proto-backed "dead"-ness in the first place, we did so in [20086](https://github.com/cockroachdb/cockroach/pull/20086) precisely to capture node "process" state in the Admin UI.  


Jira issue: CRDB-4099

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver,ui: remove usage of NodeLivenessProto #50707

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kvserver,ui: remove usage of NodeLivenessProto #50707

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions