Fixing remote master stability request when there has never been an elected master by masseyke · Pull Request #89214 · elastic/elasticsearch

masseyke · 2022-08-09T15:33:39Z

This fixes an edge case in the master stability polling code from #89014. If there has not been an elected master node for the entire life of a non-master-eligible node, then clusterChanged() will have never been called on that node, so beginPollingRemoteMasterStabilityDiagnostic() will have never been called. And even though the node might know of some master-eligible nodes, it will never have requested diagnostic information from them. This PR adds a call to beginPollingRemoteMasterStabilityDiagnostic in CoordinationDiagnosticsService's constructor to cover this edge case. In almost all cases, clusterChanged() will be called within 10 seconds so the polling will never occur. However if there is no master node then there will be no cluster changed events, and clusterChanged() will not be called, and the results of the polling will likely be useful.
This PR has several possibly controversial pieces of code. I'm listing them here with some discussion:

Because there is now a call to beginPollingRemoteMasterStabilityDiagnostic() in the ~~constructor~~ object's initialization code, beginPollingRemoteMasterStabilityDiagnostic() is no longer solely called from the cluster change thread. However, this call happens before the object is registered as a cluster service listener, so there is no new thread safety concern.
Because there is now a call to beginPollingRemoteMasterStabilityDiagnostic() in the ~~constructor~~ object's initialization code, we have to explicitly switch to the system context so that the various transport requests work in secure mode.
When we're in the constructor, we don't actually know yet whether we're a master eligible node or not, so we kick off beginPollingRemoteMasterStabilityDiagnostic() for all node types, including master-eligible nodes. This will be fairly harmless for master eligible nodes though. In the worst case, they'll retrieve some information that they'll never use. This explains why clusterChanged() now cancels polling even if we are on a master eligible node.
It is now possible that we use clusterService.state() before it is ready when we're trying to get the list of master-eligible peers. In production mode this method returns null, so we can check that before using it. If assertions are enabled in the JVM, just calling that method throws an AssertionError. I'm currently catching that with the assumption that it is harmless because there does not seem to be a way around it (without even further complicating code).
It is now possible that we call transportService.sendRequest() before the transport service is ready. This happens if the server is initializing unusually slowly (i.e. it takes more than 10 seconds to complete the Node constructor) and if assertions are enabled. I don't see a way around this without further complicating the code, so I'm catching AssertionError and moving on, with the assumption that it will work 10 seconds later when it runs again. I'm also catching and storing Exception, which I think I should have been doing before anyway.

Note: Points 3, 4, and 5 are no longer relevant because I moved the call to beginPollingRemoteMasterStabilityDiagnostic() out of the constructor, and am now calling it after the transport service and cluster state have been initialized.

…lected master

elasticsearchmachine · 2022-08-09T15:52:39Z

Pinging @elastic/es-data-management (Team:Data Management)

andreidan

Thanks for the great description Keith and for working on fixing these tricky cases

This generally looks good, left a few suggestions

andreidan · 2022-08-10T13:46:52Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+                } catch (AssertionError e) {
+                    /*
+                     * This handles a fairly rare edge case. If transportService.sendRequest throws a non-remote exception and if
+                     * assesrtions are enabled in the JVM, then an AssertionError is thrown. In this case we don't want to kill the whole


Suggested change

* assesrtions are enabled in the JVM, then an AssertionError is thrown. In this case we don't want to kill the whole

* assertions are enabled, then an AssertionError is thrown. In this case we don't want to kill the whole

andreidan · 2022-08-10T14:31:23Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+        final ThreadContext threadContext = transportService.getThreadPool().getThreadContext();
+        try (ThreadContext.StoredContext ignored = threadContext.stashContext()) {
+            threadContext.markAsSystemContext();
+            beginPollingRemoteMasterStabilityDiagnostic();
+        }


Doing business logic in the constructor is an anti-pattern as you're racing against the initializer thread (e.g. an uncompletethis could escape in the async code)

I think the clusterService.addListener(this); calls should be done outside the constructor too (as it publishes this)

Should we have an init or start method where we do these things and call this new method from the outside after we called the constructor?

Yeah I can do that.

Actually by doing that I can call init() much later in Node's startup, and avoid several of the other problems in this PR.

andreidan · 2022-08-10T14:35:53Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

-            } else {
-                cancelPollingRemoteMasterStabilityDiagnostic();
-            }
+        if (currentMaster == null && clusterService.localNode().isMasterNode() == false) {


So both cancel... methods are called every time there's a non-null master right?

Can we combine and simplify the if statemets to reflect this?
ie.

if master == null { if isMasterEligible { pollABC } else { pollEFG } } else { cancelABC cancelEFG }

What do you think?

Since I moved the call to beginPollingRemoteMasterStabilityDiagnostic() out of the constructor, I was able to check whether a node was master-eligible before calling beginPollingRemoteMasterStabilityDiagnostic(). So now we only need to cancel it for non-master-eligible nodes. So I put it back the way it used to be.

andreidan · 2022-08-10T14:36:43Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java


    void beginPollingRemoteMasterStabilityDiagnostic() {
-        assert ThreadPool.assertCurrentThreadPool(ClusterApplierService.CLUSTER_UPDATE_THREAD_NAME);
+        // Note that this method must be called from the system context because it calls internal transport actions


could this be coded as an assertion on the thread context? ie. ThreadContext#isSystemContext

It's a little more complicated than that because our test code doesn't run in the system context. I'll add a method to ThreadPool to work with unit tests like we do for assertCurrentThreadPool.

andreidan

LGTM thanks for fixing this Keith

andreidan · 2022-08-11T08:40:52Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+                } catch (Exception e) {
+                    responseConsumer.accept(responseTransformationFunction.apply(null, e));
+                }


Would the possible exceptions not be passed to the listener we pass to sendRequest? It's fine to be defensive either way :) I'm just curious (I'd say if some exceptions escape that'd be a bug?)

We only catch a NodeNotConnectedException thrown from ConnectionManager#getConnection so in theory something else could escape from here, although in practice that's the only exception that any implementations will throw. We should probably just catch Exception here to make sure, rather than putting the onus on callers. Would you open an issue for the distrib team to fix that?

Thanks David. Opened #89274

andreidan · 2022-08-11T09:22:37Z

server/src/main/java/org/elasticsearch/node/Node.java

        assert clusterService.localNode().equals(localNodeFactory.getNode())
            : "clusterService has a different local node than the factory provided";
        transportService.acceptIncomingRequests();
+        injector.getInstance(CoordinationDiagnosticsService.class).start();


Shall we add a comment why it's important for this service to be started here?

Fixing remote master stability request when there has never been an e…

c53ae46

…lected master

masseyke added >non-issue :Distributed/Health Issues for the health report API v8.5.0 labels Aug 9, 2022

masseyke requested a review from andreidan August 9, 2022 15:52

masseyke marked this pull request as ready for review August 9, 2022 15:52

elasticsearchmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Aug 9, 2022

andreidan reviewed Aug 10, 2022

View reviewed changes

masseyke added 2 commits August 10, 2022 12:29

code review feedback

4b13162

removing unnecessary comment

40c38fe

masseyke requested a review from andreidan August 10, 2022 17:38

andreidan approved these changes Aug 11, 2022

View reviewed changes

code review feedback

2f893b4

masseyke added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Aug 11, 2022

elasticsearchmachine merged commit e4a19d4 into elastic:main Aug 11, 2022

masseyke deleted the fix/master-stability-edge-case branch August 11, 2022 14:19

masseyke mentioned this pull request Jun 14, 2023

Fix CoordinationDiagnosticsServiceIT leaving broken global state #96847

Merged

	* assesrtions are enabled in the JVM, then an AssertionError is thrown. In this case we don't want to kill the whole
	* assertions are enabled, then an AssertionError is thrown. In this case we don't want to kill the whole

Conversation

masseyke commented Aug 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Aug 9, 2022

Uh oh!

andreidan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreidan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

masseyke commented Aug 9, 2022 •

edited

Loading