Fixing remote master stability request when there has never been an elected master#89214
Conversation
|
Pinging @elastic/es-data-management (Team:Data Management) |
andreidan
left a comment
There was a problem hiding this comment.
Thanks for the great description Keith and for working on fixing these tricky cases
This generally looks good, left a few suggestions
| } catch (AssertionError e) { | ||
| /* | ||
| * This handles a fairly rare edge case. If transportService.sendRequest throws a non-remote exception and if | ||
| * assesrtions are enabled in the JVM, then an AssertionError is thrown. In this case we don't want to kill the whole |
There was a problem hiding this comment.
| * assesrtions are enabled in the JVM, then an AssertionError is thrown. In this case we don't want to kill the whole | |
| * assertions are enabled, then an AssertionError is thrown. In this case we don't want to kill the whole |
| final ThreadContext threadContext = transportService.getThreadPool().getThreadContext(); | ||
| try (ThreadContext.StoredContext ignored = threadContext.stashContext()) { | ||
| threadContext.markAsSystemContext(); | ||
| beginPollingRemoteMasterStabilityDiagnostic(); | ||
| } |
There was a problem hiding this comment.
Doing business logic in the constructor is an anti-pattern as you're racing against the initializer thread (e.g. an uncompletethis could escape in the async code)
I think the clusterService.addListener(this); calls should be done outside the constructor too (as it publishes this)
Should we have an init or start method where we do these things and call this new method from the outside after we called the constructor?
There was a problem hiding this comment.
Actually by doing that I can call init() much later in Node's startup, and avoid several of the other problems in this PR.
| } else { | ||
| cancelPollingRemoteMasterStabilityDiagnostic(); | ||
| } | ||
| if (currentMaster == null && clusterService.localNode().isMasterNode() == false) { |
There was a problem hiding this comment.
So both cancel... methods are called every time there's a non-null master right?
Can we combine and simplify the if statemets to reflect this?
ie.
if master == null {
if isMasterEligible { pollABC } else { pollEFG }
} else {
cancelABC
cancelEFG
}
What do you think?
There was a problem hiding this comment.
Since I moved the call to beginPollingRemoteMasterStabilityDiagnostic() out of the constructor, I was able to check whether a node was master-eligible before calling beginPollingRemoteMasterStabilityDiagnostic(). So now we only need to cancel it for non-master-eligible nodes. So I put it back the way it used to be.
|
|
||
| void beginPollingRemoteMasterStabilityDiagnostic() { | ||
| assert ThreadPool.assertCurrentThreadPool(ClusterApplierService.CLUSTER_UPDATE_THREAD_NAME); | ||
| // Note that this method must be called from the system context because it calls internal transport actions |
There was a problem hiding this comment.
could this be coded as an assertion on the thread context? ie. ThreadContext#isSystemContext
There was a problem hiding this comment.
It's a little more complicated than that because our test code doesn't run in the system context. I'll add a method to ThreadPool to work with unit tests like we do for assertCurrentThreadPool.
andreidan
left a comment
There was a problem hiding this comment.
LGTM thanks for fixing this Keith
| } catch (Exception e) { | ||
| responseConsumer.accept(responseTransformationFunction.apply(null, e)); | ||
| } |
There was a problem hiding this comment.
Would the possible exceptions not be passed to the listener we pass to sendRequest? It's fine to be defensive either way :) I'm just curious (I'd say if some exceptions escape that'd be a bug?)
There was a problem hiding this comment.
We only catch a NodeNotConnectedException thrown from ConnectionManager#getConnection so in theory something else could escape from here, although in practice that's the only exception that any implementations will throw. We should probably just catch Exception here to make sure, rather than putting the onus on callers. Would you open an issue for the distrib team to fix that?
| assert clusterService.localNode().equals(localNodeFactory.getNode()) | ||
| : "clusterService has a different local node than the factory provided"; | ||
| transportService.acceptIncomingRequests(); | ||
| injector.getInstance(CoordinationDiagnosticsService.class).start(); |
There was a problem hiding this comment.
Shall we add a comment why it's important for this service to be started here?
This fixes an edge case in the master stability polling code from #89014. If there has not been an elected master node for the entire life of a non-master-eligible node, then
clusterChanged()will have never been called on that node, sobeginPollingRemoteMasterStabilityDiagnostic()will have never been called. And even though the node might know of some master-eligible nodes, it will never have requested diagnostic information from them. This PR adds a call tobeginPollingRemoteMasterStabilityDiagnosticinCoordinationDiagnosticsService's constructor to cover this edge case. In almost all cases,clusterChanged()will be called within 10 seconds so the polling will never occur. However if there is no master node then there will be no cluster changed events, andclusterChanged()will not be called, and the results of the polling will likely be useful.This PR has several possibly controversial pieces of code. I'm listing them here with some discussion:
beginPollingRemoteMasterStabilityDiagnostic()in theconstructorobject's initialization code,beginPollingRemoteMasterStabilityDiagnostic()is no longer solely called from the cluster change thread. However, this call happens before the object is registered as a cluster service listener, so there is no new thread safety concern.beginPollingRemoteMasterStabilityDiagnostic()in theconstructorobject's initialization code, we have to explicitly switch to the system context so that the various transport requests work in secure mode.When we're in the constructor, we don't actually know yet whether we're a master eligible node or not, so we kick offbeginPollingRemoteMasterStabilityDiagnostic()for all node types, including master-eligible nodes. This will be fairly harmless for master eligible nodes though. In the worst case, they'll retrieve some information that they'll never use. This explains whyclusterChanged()now cancels polling even if we are on a master eligible node.It is now possible that we useclusterService.state()before it is ready when we're trying to get the list of master-eligible peers. In production mode this method returns null, so we can check that before using it. If assertions are enabled in the JVM, just calling that method throws anAssertionError. I'm currently catching that with the assumption that it is harmless because there does not seem to be a way around it (without even further complicating code).It is now possible that we calltransportService.sendRequest()before the transport service is ready. This happens if the server is initializing unusually slowly (i.e. it takes more than 10 seconds to complete theNodeconstructor) and if assertions are enabled. I don't see a way around this without further complicating the code, so I'm catchingAssertionErrorand moving on, with the assumption that it will work 10 seconds later when it runs again. I'm also catching and storingException, which I think I should have been doing before anyway.Note: Points 3, 4, and 5 are no longer relevant because I moved the call to
beginPollingRemoteMasterStabilityDiagnostic()out of the constructor, and am now calling it after the transport service and cluster state have been initialized.