Add stateless ReindexRelocationNodePicker impl#140267
Add stateless ReindexRelocationNodePicker impl#140267PeteGillinElastic merged 14 commits intoelastic:mainfrom
ReindexRelocationNodePicker impl#140267Conversation
4476680 to
81c4a42
Compare
This adds an implmentation of the helper which picks a node to relocate a reindex task to on shutdown to be used in stateless. This changes the mechanism for choosing the correct implementation that was added in elastic#139081. The previous mechanism assumed that we'd want *serverless* to behave differently, and so it defined a default implementation and allowed for that to be swapped out via another plugin. But we actually want *stateless* to behave differently, so we can take the simpler approach of doing everything inside the `reindex` plugin and just pick the correct implementation based on settings. The implementation added previously has been renamed, because it is now explicitly used in stateful rather than just being used by default.
81c4a42 to
d6afa2f
Compare
|
Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing) |
…node-picker-in-stateless
…node-picker-in-stateless
…node-picker-in-stateless
…node-picker-in-stateless
szybia
left a comment
There was a problem hiding this comment.
few considerations, nothing to block on
fwiw if you agree with some suggestions things should be changed in both implementations of ReindexRelocationNodePicker
| public String pickNode(DiscoveryNodes nodes) { | ||
| String currentNodeId = nodes.getLocalNodeId(); | ||
| if (currentNodeId == null) { | ||
| logger.debug( |
There was a problem hiding this comment.
should this be a higher-priority log? (ik you're doing the same as in StatefulNodePicker)
but sounds like a very wrong situation to be in
and also sounds useful to get this in logs if it ever happens and we're investigating why a relocation didn't happen? 🤔
There was a problem hiding this comment.
Yeah, I think you're probably right. FWIW, DiscoveryNodes.isLocalNodeElectedMaster() handles the case where localNodeId == null and there's a comment saying we don't know yet the local node id, which implies that it is expected under some circumstances. However, I would imagine that this is something that happens early in the discovery process, and it shouldn't have got as far as running real user work (such as reindex tasks) without assigning this.
I'll bump it up to warn, in both impls.
| .toList(); | ||
| if (eligibleDedicatedCoordinatingNodes.isEmpty() == false) { | ||
| String newNodeId = randomNodeId(eligibleDedicatedCoordinatingNodes); | ||
| logger.debug("Chose dedicated coordinating node ID {} for relocating a reindex task from node {}", newNodeId, currentNodeId); |
There was a problem hiding this comment.
doesn't need to be here but starting conversation
i'm thinking we'll want to have some logging to say i'm relocating this task to this node
if so, i'm imagining we'll have it higher up in the call stack and not here?
There was a problem hiding this comment.
Yeah, anything like that would be higher up the stack. It's not really the concern of this thing to know anything much about the context it's being used in.
| new UpdateByQueryMetrics(services.telemetryProvider().getMeterRegistry()), | ||
| new DeleteByQueryMetrics(services.telemetryProvider().getMeterRegistry()), | ||
| new PluginComponentBinding<>(ReindexRelocationNodePicker.class, relocationNodePicker.get()) | ||
| new PluginComponentBinding<>( |
There was a problem hiding this comment.
i think potential for this to mess anything up is very small, but maybe a will sleep 1% better at night improvement
should we put this being returned in the list behind feature flag so "truly" none of our changes can affect release build?
There was a problem hiding this comment.
I think that would be tricky. The binding will be used in the @Inject constructor of TransportReindexAction, and injection will fail if there isn't anything bound to this key.
I guess that we could maybe mark it as @Nullable in that constructor, and bind it to null if the feature flag isn't set. However, once we remove the feature flag, we won't really want it to be nullable. And I worry that we'd forget to chase down the places we'd made it nullable (we'll be passing it into Reindexer, I think) and change that. I'd be more worried about that than I would about the fact that we're instantiating this thing in production without using it.
| private static final Set<DiscoveryNodeRole> SEARCH_NODE_ROLES = Set.of( | ||
| DiscoveryNodeRole.SEARCH_ROLE, | ||
| DiscoveryNodeRole.TRANSFORM_ROLE, | ||
| DiscoveryNodeRole.REMOTE_CLUSTER_CLIENT_ROLE | ||
| ); |
There was a problem hiding this comment.
will probably scan for it myself, but if you get here before i do
where are these defined (for learning and for double-checking)?
(might be private, therefore slack?)
There was a problem hiding this comment.
If you're asking how I know that these are the roles assigned to each node type in serverless: I just did GET /_nodes on a running instance. (As it happens, you prompted me to go poking around in the code for the controller, which is where this stuff is set up. That raised one curious thing, there's a thread in slack if you're interested.)
...es/reindex/src/main/java/org/elasticsearch/reindex/StatelessReindexRelocationNodePicker.java
Outdated
Show resolved
Hide resolved
| "Trying to pick a node to relocate a reindex task to, but there are no dedicated coordinating or indexing nodes " | ||
| + "(perhaps excluding the current node): the relocation attempt will be aborted" | ||
| ); | ||
| return null; |
There was a problem hiding this comment.
fwiw, imo would be nice for interface to be Optional<String>
not a fan of @Nullable
There was a problem hiding this comment.
Yeah, that's probably an improvement. Done.
Co-authored-by: Szymon Bialkowski <szybia@tuta.io>
…node-picker-in-stateless
Thanks! All done, except for one where I've pushed back for reasons. |
| import java.util.Optional; | ||
| import java.util.Random; | ||
|
|
||
| class StatelessReindexRelocationNodePicker implements ReindexRelocationNodePicker { |
There was a problem hiding this comment.
I wonder if we should avoid picking a node that's marked for shutdown in NodesShutdownMetadata, since relocating to a to-be-shutdown node will just lead to a second relocation? I think a node is removed from DiscoveryNodes by NodeLeftExecutor when it's already shutdown, so it seems a to-be-shutdown node can still exists in DiscoveryNodes.
The NodeShutdownAllocationDecider has something similar to avoid nodes marked for shutdown. In our case, the logic to exclude shutdown nodes could be shared by stateless and stateful?
There was a problem hiding this comment.
That's a really good idea, thanks for pointing it out.
This PR has been in CI purgatory for ages and finally has a green light (if you'll forgive the mixed metaphors) so I'm going to merge it and do this (for both cases) as a follow-up.
This adds an implmentation of the helper which picks a node to relocate a reindex task to on shutdown to be used in stateless. This changes the mechanism for choosing the correct implementation that was added in elastic#139081. The previous mechanism assumed that we'd want *serverless* to behave differently, and so it defined a default implementation and allowed for that to be swapped out via another plugin. But we actually want *stateless* to behave differently, so we can take the simpler approach of doing everything inside the `reindex` plugin and just pick the correct implementation based on settings. The implementation added previously has been renamed, because it is now explicitly used in stateful rather than just being used by default. Co-authored-by: Szymon Bialkowski <szybia@tuta.io> Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
This adds an implmentation of the helper which picks a node to relocate a reindex task to on shutdown to be used in stateless.
This changes the mechanism for choosing the correct implementation that was added in #139081. The previous mechanism assumed that we'd want serverless to behave differently, and so it defined a default implementation and allowed for that to be swapped out via another plugin. But we actually want stateless to behave differently, so we can take the simpler approach of doing everything inside the
reindexplugin and just pick the correct implementation based on settings. The implementation added previously has been renamed, because it is now explicitly used in stateful rather than just being used by default.Closes elastic/elasticsearch-team#2093