Skip to content

Fixing remote master stability request when there has never been an elected master#89214

Merged
elasticsearchmachine merged 4 commits intoelastic:mainfrom
masseyke:fix/master-stability-edge-case
Aug 11, 2022
Merged

Fixing remote master stability request when there has never been an elected master#89214
elasticsearchmachine merged 4 commits intoelastic:mainfrom
masseyke:fix/master-stability-edge-case

Conversation

@masseyke
Copy link
Copy Markdown
Member

@masseyke masseyke commented Aug 9, 2022

This fixes an edge case in the master stability polling code from #89014. If there has not been an elected master node for the entire life of a non-master-eligible node, then clusterChanged() will have never been called on that node, so beginPollingRemoteMasterStabilityDiagnostic() will have never been called. And even though the node might know of some master-eligible nodes, it will never have requested diagnostic information from them. This PR adds a call to beginPollingRemoteMasterStabilityDiagnostic in CoordinationDiagnosticsService's constructor to cover this edge case. In almost all cases, clusterChanged() will be called within 10 seconds so the polling will never occur. However if there is no master node then there will be no cluster changed events, and clusterChanged() will not be called, and the results of the polling will likely be useful.
This PR has several possibly controversial pieces of code. I'm listing them here with some discussion:

  1. Because there is now a call to beginPollingRemoteMasterStabilityDiagnostic() in the constructor object's initialization code, beginPollingRemoteMasterStabilityDiagnostic() is no longer solely called from the cluster change thread. However, this call happens before the object is registered as a cluster service listener, so there is no new thread safety concern.
  2. Because there is now a call to beginPollingRemoteMasterStabilityDiagnostic() in the constructor object's initialization code, we have to explicitly switch to the system context so that the various transport requests work in secure mode.
  3. When we're in the constructor, we don't actually know yet whether we're a master eligible node or not, so we kick off beginPollingRemoteMasterStabilityDiagnostic() for all node types, including master-eligible nodes. This will be fairly harmless for master eligible nodes though. In the worst case, they'll retrieve some information that they'll never use. This explains why clusterChanged() now cancels polling even if we are on a master eligible node.
  4. It is now possible that we use clusterService.state() before it is ready when we're trying to get the list of master-eligible peers. In production mode this method returns null, so we can check that before using it. If assertions are enabled in the JVM, just calling that method throws an AssertionError. I'm currently catching that with the assumption that it is harmless because there does not seem to be a way around it (without even further complicating code).
  5. It is now possible that we call transportService.sendRequest() before the transport service is ready. This happens if the server is initializing unusually slowly (i.e. it takes more than 10 seconds to complete the Node constructor) and if assertions are enabled. I don't see a way around this without further complicating the code, so I'm catching AssertionError and moving on, with the assumption that it will work 10 seconds later when it runs again. I'm also catching and storing Exception, which I think I should have been doing before anyway.

Note: Points 3, 4, and 5 are no longer relevant because I moved the call to beginPollingRemoteMasterStabilityDiagnostic() out of the constructor, and am now calling it after the transport service and cluster state have been initialized.

@masseyke masseyke added >non-issue :Distributed/Health Issues for the health report API v8.5.0 labels Aug 9, 2022
@masseyke masseyke requested a review from andreidan August 9, 2022 15:52
@masseyke masseyke marked this pull request as ready for review August 9, 2022 15:52
@elasticsearchmachine elasticsearchmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Aug 9, 2022
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great description Keith and for working on fixing these tricky cases

This generally looks good, left a few suggestions

} catch (AssertionError e) {
/*
* This handles a fairly rare edge case. If transportService.sendRequest throws a non-remote exception and if
* assesrtions are enabled in the JVM, then an AssertionError is thrown. In this case we don't want to kill the whole
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* assesrtions are enabled in the JVM, then an AssertionError is thrown. In this case we don't want to kill the whole
* assertions are enabled, then an AssertionError is thrown. In this case we don't want to kill the whole

Comment on lines +182 to +186
final ThreadContext threadContext = transportService.getThreadPool().getThreadContext();
try (ThreadContext.StoredContext ignored = threadContext.stashContext()) {
threadContext.markAsSystemContext();
beginPollingRemoteMasterStabilityDiagnostic();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing business logic in the constructor is an anti-pattern as you're racing against the initializer thread (e.g. an uncompletethis could escape in the async code)

I think the clusterService.addListener(this); calls should be done outside the constructor too (as it publishes this)

Should we have an init or start method where we do these things and call this new method from the outside after we called the constructor?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can do that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually by doing that I can call init() much later in Node's startup, and avoid several of the other problems in this PR.

} else {
cancelPollingRemoteMasterStabilityDiagnostic();
}
if (currentMaster == null && clusterService.localNode().isMasterNode() == false) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So both cancel... methods are called every time there's a non-null master right?

Can we combine and simplify the if statemets to reflect this?
ie.

   if master == null { 
       if isMasterEligible { pollABC } else { pollEFG }
    } else {
       cancelABC
       cancelEFG
    }

What do you think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I moved the call to beginPollingRemoteMasterStabilityDiagnostic() out of the constructor, I was able to check whether a node was master-eligible before calling beginPollingRemoteMasterStabilityDiagnostic(). So now we only need to cancel it for non-master-eligible nodes. So I put it back the way it used to be.


void beginPollingRemoteMasterStabilityDiagnostic() {
assert ThreadPool.assertCurrentThreadPool(ClusterApplierService.CLUSTER_UPDATE_THREAD_NAME);
// Note that this method must be called from the system context because it calls internal transport actions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be coded as an assertion on the thread context? ie. ThreadContext#isSystemContext

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little more complicated than that because our test code doesn't run in the system context. I'll add a method to ThreadPool to work with unit tests like we do for assertCurrentThreadPool.

@masseyke masseyke requested a review from andreidan August 10, 2022 17:38
Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for fixing this Keith

Comment on lines +1005 to +1007
} catch (Exception e) {
responseConsumer.accept(responseTransformationFunction.apply(null, e));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the possible exceptions not be passed to the listener we pass to sendRequest? It's fine to be defensive either way :) I'm just curious (I'd say if some exceptions escape that'd be a bug?)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only catch a NodeNotConnectedException thrown from ConnectionManager#getConnection so in theory something else could escape from here, although in practice that's the only exception that any implementations will throw. We should probably just catch Exception here to make sure, rather than putting the onus on callers. Would you open an issue for the distrib team to fix that?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks David. Opened #89274

assert clusterService.localNode().equals(localNodeFactory.getNode())
: "clusterService has a different local node than the factory provided";
transportService.acceptIncomingRequests();
injector.getInstance(CoordinationDiagnosticsService.class).start();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a comment why it's important for this service to be started here?

@masseyke masseyke added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Aug 11, 2022
@elasticsearchmachine elasticsearchmachine merged commit e4a19d4 into elastic:main Aug 11, 2022
@masseyke masseyke deleted the fix/master-stability-edge-case branch August 11, 2022 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Health Issues for the health report API >non-issue Team:Data Management (obsolete) DO NOT USE. This team no longer exists. v8.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants