-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Nodes in process of shutting down should not respond to discovery pings #27328
Description
It looks like a node that is being shutdown will still reply to an discovery ping.
Master node-108 is gracefully shutdown, but still listed in the current nodes list. I guess this is because its just some milliseconds after the shutdown started.
[2017-11-08T15:01:00,779][INFO ][o.e.d.z.ZenDiscovery ] [node-106] master_left [{node-108}{SnsBg56PRuintH0DINacJQ}{Q0yTDGUpQNag1dKS56yoHw}{node108.home.lan}{10.10.10.15:9300}], reason [shut_down]
[2017-11-08T15:01:00,781][WARN ][o.e.d.z.ZenDiscovery ] [node-106] master left (reason = shut_down), current nodes: nodes:
...
{node-108}{SnsBg56PRuintH0DINacJQ}{Q0yTDGUpQNag1dKS56yoHw}{node108.home.lan}{10.10.10.15:9300}, master
Then 3 seconds later the ping responses coming back and node-108 is still listed. I checked the log on node-108, but there is nothing later than 2017-11-08T15:01:00,843, which makes me wonder why it is still listed in the ping list.
[2017-11-08T15:01:03,785][TRACE][o.e.d.z.ZenDiscovery ] [node-106] full ping responses:
...
--> ping_response{node [{node-108}{SnsBg56PRuintH0DINacJQ}{Q0yTDGUpQNag1dKS56yoHw}{node108.home.lan}{10.10.10.15:9300}], id[44], master [{node-108}{SnsBg56PRuintH0DINacJQ}{Q0yTDGUpQNag1dKS56yoHw}{node108.home.lan}{10.10.10.15:9300}],cluster_state_version [420], cluster_name[my-cluster]}
...
[2017-11-08T15:01:03,789][WARN ][o.e.d.z.ZenDiscovery ] [node-106] failed to connect to master [{node-108}{SnsBg56PRuintH0DINacJQ}{Q0yTDGUpQNag1dKS56yoHw}{node108.home.lan}{10.10.10.15:9300}], retrying...
Only explanation I have is that ping was replied before 2017-11-08T15:01:00,843 already, which would match up with the ping starting : :00,843 vs :00,782
[2017-11-08T15:01:00,782][TRACE][o.e.d.z.ZenDiscovery ] [node-106] starting to ping
[2017-11-08T15:01:00,782][TRACE][o.e.d.z.ZenDiscovery ] [node-107] starting to ping
Then another ping is sent and 3 seconds later the reply does not list 108 anymore and a new master is elected.
[2017-11-08T15:01:03,798][TRACE][o.e.d.z.ZenDiscovery ] [node-106] starting to ping
[2017-11-08T15:01:06,799][TRACE][o.e.d.z.ZenDiscovery ] [node-106] full ping responses:
...
--> ping_response{node [{node-106}{668jPO1YSweDaqRyQxMRwA}{LwYDV2eBQAC3ialrV4qEbQ}{node106.home.lan}{10.10.10.13:9300}], id[39], master [null],cluster_state_version [420], cluster_name[my-cluster]}
[2017-11-08T15:01:06,800][TRACE][o.e.d.z.ZenDiscovery ] [node-106] candidate Candidate{node={node-106}{668jPO1YSweDaqRyQxMRwA}{LwYDV2eBQAC3ialrV4qEbQ}{node106.home.lan}{10.10.10.13:9300}, clusterStateVersion=420} won election
[2017-11-08T15:01:06,801][DEBUG][o.e.d.z.ZenDiscovery ] [node-106] elected as master, waiting for incoming joins ([1] needed)
[2017-11-08T15:01:06,822][INFO ][o.e.c.s.ClusterService ] [node-106] new_master {node-106}{668jPO1YSweDaqRyQxMRwA}{LwYDV2eBQAC3ialrV4qEbQ}{node106.home.lan}{10.10.10.13:9300}, reason: zen-disco-elected-as-master ([5] nodes joined)[{node-113}{1QzSsxO8ToqgUtfYycgmlw}{NUnBCQ_eSeSZjQV5PE1Jtg}{node113.home.lan}{10.10.10.20:9300}, {node-104}{59gFtGGvQGGy0LLVMIAgig}{Eg5gYe77SYuXB6vZfKKicg}{node104.home.lan}{10.10.10.214:9300}, {node-107}{_-P9koqwQAaBDpPyoflXbw}{j6BbufsTQq6kU-TFbAZW5Q}{node107.home.lan}{10.10.10.14:9300}, {node-105}{kt2uxSxNTDe9JhRuuWORYA}{VFf4A4bfTw6WoagiH6lpTw}{node105.home.lan}{10.10.10.215:9300}, {node-103}{TPSttSagTZq2VCDUhX5cGw}{puBEOLAARtmigK1y0lH5yw}{node103.home.lan}{10.10.10.213:9300}]
Had a quick chat with @ywelsch and it looks like a node in shutdown would still reply to a discovery ping request if the shutdown is not yet finished.
Since such a node will be down shortly, it should not reply to a discovery ping.
It could also speed up master election. I.e. in the above example an extra cycle of 3 seconds was added because the old master was still listed.