gossip: Liveness not gossiped after single node restarts

In a nine-node cluster running on GCE (which contained node IDs higher than 9 due to past decommissions), the following sequence of events occurred:

1. Node 12 was restarted (to upgrade from 2.0.3 to 2.0.4)
1. Gossip-based metrics (live nodes, underreplicated ranges) started to indicate problems. It's unclear whether there were any real problems. The web UI showed 3x as many ranges as there should be, 2/3 of which were "underreplicated', and most nodes showed 6 of the 9 nodes as being down.
1. Node 12 was restarted again, to revert to 2.0.3. Nothing really changed; the gossip metric problems remained the same
1. Node 4 was restarted (on a hunch because this node had been linked to another superficially-similar incident). This triggered something that brought everything back to normal, about an hour after the first restart.

This appears to be a gossip problem. By the `liveness.livenodes` metric, we can see that three nodes (12, 4, and 9) saw all 9 nodes as alive. The other six nodes only considered one node (themselves) to be alive most of the time, although this number fluctuated a bit.

![image](https://user-images.githubusercontent.com/160562/42909134-05ab241e-8ab1-11e8-97cd-63d98068c13e.png)

The KV portions of NodeLiveness were working correctly, as shown by the nodes that believed all 9 nodes were alive and confirmed by the `liveness.heartbeatsuccesses` metric. Only the gossip portion was (mostly) missing.

Complete information from the outage is not available, but the gossip graph shows no abnormalities while the cluster is healthy (for example, there are no single points of failure that could partition the graph). The image in #27652 is taken from this cluster after the outage.

At the time of the outage, the cluster was experiencing unusual clock offsets on several nodes (4, 5, and 12). The clock offset graph shows a sawtooth pattern with offsets on these nodes steadily increasing to about 200ms, then snapping back to 0.  This pattern occurred for roughly the 24 hour period surrounding the outage but had not been seen before or since. It is suspected that this is due to a widely-reported GCP outage on the same day.

Gossip relies on the monotonicity of the system clock (with an in-process ratchet). If the clock jumps backwards while the process is being restarted, some messages could get dropped (for a duration after startup equal to the clock jump). This would include the startup gossip messages in which a new node announces its presence. However, the liveness messages are re-gossiped every ~5s, so as far as I can tell the system should recover from any dropped messages within a few seconds. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gossip: Liveness not gossiped after single node restarts #27731

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gossip: Liveness not gossiped after single node restarts #27731

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions