Skip to content

gossip: Liveness not gossiped after single node restarts #27731

@bdarnell

Description

@bdarnell

In a nine-node cluster running on GCE (which contained node IDs higher than 9 due to past decommissions), the following sequence of events occurred:

  1. Node 12 was restarted (to upgrade from 2.0.3 to 2.0.4)
  2. Gossip-based metrics (live nodes, underreplicated ranges) started to indicate problems. It's unclear whether there were any real problems. The web UI showed 3x as many ranges as there should be, 2/3 of which were "underreplicated', and most nodes showed 6 of the 9 nodes as being down.
  3. Node 12 was restarted again, to revert to 2.0.3. Nothing really changed; the gossip metric problems remained the same
  4. Node 4 was restarted (on a hunch because this node had been linked to another superficially-similar incident). This triggered something that brought everything back to normal, about an hour after the first restart.

This appears to be a gossip problem. By the liveness.livenodes metric, we can see that three nodes (12, 4, and 9) saw all 9 nodes as alive. The other six nodes only considered one node (themselves) to be alive most of the time, although this number fluctuated a bit.

image

The KV portions of NodeLiveness were working correctly, as shown by the nodes that believed all 9 nodes were alive and confirmed by the liveness.heartbeatsuccesses metric. Only the gossip portion was (mostly) missing.

Complete information from the outage is not available, but the gossip graph shows no abnormalities while the cluster is healthy (for example, there are no single points of failure that could partition the graph). The image in #27652 is taken from this cluster after the outage.

At the time of the outage, the cluster was experiencing unusual clock offsets on several nodes (4, 5, and 12). The clock offset graph shows a sawtooth pattern with offsets on these nodes steadily increasing to about 200ms, then snapping back to 0. This pattern occurred for roughly the 24 hour period surrounding the outage but had not been seen before or since. It is suspected that this is due to a widely-reported GCP outage on the same day.

Gossip relies on the monotonicity of the system clock (with an in-process ratchet). If the clock jumps backwards while the process is being restarted, some messages could get dropped (for a duration after startup equal to the clock jump). This would include the startup gossip messages in which a new node announces its presence. However, the liveness messages are re-gossiped every ~5s, so as far as I can tell the system should recover from any dropped messages within a few seconds.

Metadata

Metadata

Labels

A-kv-gossipC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsS-2-temp-unavailabilityTemp crashes or other availability problems. Can be worked around or resolved by restarting.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions