Skip to content

Add stateless ReindexRelocationNodePicker impl#140267

Merged
PeteGillinElastic merged 14 commits intoelastic:mainfrom
PeteGillinElastic:reindex-relocation-node-picker-in-stateless
Jan 21, 2026
Merged

Add stateless ReindexRelocationNodePicker impl#140267
PeteGillinElastic merged 14 commits intoelastic:mainfrom
PeteGillinElastic:reindex-relocation-node-picker-in-stateless

Conversation

@PeteGillinElastic
Copy link
Copy Markdown
Member

@PeteGillinElastic PeteGillinElastic commented Jan 7, 2026

This adds an implmentation of the helper which picks a node to relocate a reindex task to on shutdown to be used in stateless.

This changes the mechanism for choosing the correct implementation that was added in #139081. The previous mechanism assumed that we'd want serverless to behave differently, and so it defined a default implementation and allowed for that to be swapped out via another plugin. But we actually want stateless to behave differently, so we can take the simpler approach of doing everything inside the reindex plugin and just pick the correct implementation based on settings. The implementation added previously has been renamed, because it is now explicitly used in stateful rather than just being used by default.

Closes elastic/elasticsearch-team#2093

@PeteGillinElastic PeteGillinElastic added >non-issue :Distributed/Reindex Issues relating to reindex that are not caused by issues further down labels Jan 7, 2026
@PeteGillinElastic PeteGillinElastic force-pushed the reindex-relocation-node-picker-in-stateless branch from 4476680 to 81c4a42 Compare January 7, 2026 14:04
This adds an implmentation of the helper which picks a node to
relocate a reindex task to on shutdown to be used in stateless.

This changes the mechanism for choosing the correct implementation
that was added in elastic#139081. The previous mechanism assumed that we'd
want *serverless* to behave differently, and so it defined a default
implementation and allowed for that to be swapped out via another
plugin. But we actually want *stateless* to behave differently, so we
can take the simpler approach of doing everything inside the `reindex`
plugin and just pick the correct implementation based on settings. The
implementation added previously has been renamed, because it is now
explicitly used in stateful rather than just being used by default.
@PeteGillinElastic PeteGillinElastic force-pushed the reindex-relocation-node-picker-in-stateless branch from 81c4a42 to d6afa2f Compare January 7, 2026 15:20
@PeteGillinElastic PeteGillinElastic marked this pull request as ready for review January 7, 2026 17:25
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Indexing (obsolete) Meta label for Distributed Indexing team. Obsolete. Please do not use. label Jan 7, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

Copy link
Copy Markdown
Contributor

@szybia szybia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few considerations, nothing to block on

fwiw if you agree with some suggestions things should be changed in both implementations of ReindexRelocationNodePicker

public String pickNode(DiscoveryNodes nodes) {
String currentNodeId = nodes.getLocalNodeId();
if (currentNodeId == null) {
logger.debug(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a higher-priority log? (ik you're doing the same as in StatefulNodePicker)

but sounds like a very wrong situation to be in

and also sounds useful to get this in logs if it ever happens and we're investigating why a relocation didn't happen? 🤔

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think you're probably right. FWIW, DiscoveryNodes.isLocalNodeElectedMaster() handles the case where localNodeId == null and there's a comment saying we don't know yet the local node id, which implies that it is expected under some circumstances. However, I would imagine that this is something that happens early in the discovery process, and it shouldn't have got as far as running real user work (such as reindex tasks) without assigning this.

I'll bump it up to warn, in both impls.

.toList();
if (eligibleDedicatedCoordinatingNodes.isEmpty() == false) {
String newNodeId = randomNodeId(eligibleDedicatedCoordinatingNodes);
logger.debug("Chose dedicated coordinating node ID {} for relocating a reindex task from node {}", newNodeId, currentNodeId);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need to be here but starting conversation

i'm thinking we'll want to have some logging to say i'm relocating this task to this node

if so, i'm imagining we'll have it higher up in the call stack and not here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, anything like that would be higher up the stack. It's not really the concern of this thing to know anything much about the context it's being used in.

new UpdateByQueryMetrics(services.telemetryProvider().getMeterRegistry()),
new DeleteByQueryMetrics(services.telemetryProvider().getMeterRegistry()),
new PluginComponentBinding<>(ReindexRelocationNodePicker.class, relocationNodePicker.get())
new PluginComponentBinding<>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think potential for this to mess anything up is very small, but maybe a will sleep 1% better at night improvement

should we put this being returned in the list behind feature flag so "truly" none of our changes can affect release build?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be tricky. The binding will be used in the @Inject constructor of TransportReindexAction, and injection will fail if there isn't anything bound to this key.

I guess that we could maybe mark it as @Nullable in that constructor, and bind it to null if the feature flag isn't set. However, once we remove the feature flag, we won't really want it to be nullable. And I worry that we'd forget to chase down the places we'd made it nullable (we'll be passing it into Reindexer, I think) and change that. I'd be more worried about that than I would about the fact that we're instantiating this thing in production without using it.

Comment on lines +27 to +31
private static final Set<DiscoveryNodeRole> SEARCH_NODE_ROLES = Set.of(
DiscoveryNodeRole.SEARCH_ROLE,
DiscoveryNodeRole.TRANSFORM_ROLE,
DiscoveryNodeRole.REMOTE_CLUSTER_CLIENT_ROLE
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will probably scan for it myself, but if you get here before i do

where are these defined (for learning and for double-checking)?

(might be private, therefore slack?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're asking how I know that these are the roles assigned to each node type in serverless: I just did GET /_nodes on a running instance. (As it happens, you prompted me to go poking around in the code for the controller, which is where this stuff is set up. That raised one curious thing, there's a thread in slack if you're interested.)

"Trying to pick a node to relocate a reindex task to, but there are no dedicated coordinating or indexing nodes "
+ "(perhaps excluding the current node): the relocation attempt will be aborted"
);
return null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, imo would be nice for interface to be Optional<String>

not a fan of @Nullable

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's probably an improvement. Done.

@PeteGillinElastic
Copy link
Copy Markdown
Member Author

few considerations, nothing to block on

fwiw if you agree with some suggestions things should be changed in both implementations of ReindexRelocationNodePicker

Thanks! All done, except for one where I've pushed back for reasons.

import java.util.Optional;
import java.util.Random;

class StatelessReindexRelocationNodePicker implements ReindexRelocationNodePicker {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should avoid picking a node that's marked for shutdown in NodesShutdownMetadata, since relocating to a to-be-shutdown node will just lead to a second relocation? I think a node is removed from DiscoveryNodes by NodeLeftExecutor when it's already shutdown, so it seems a to-be-shutdown node can still exists in DiscoveryNodes.

The NodeShutdownAllocationDecider has something similar to avoid nodes marked for shutdown. In our case, the logic to exclude shutdown nodes could be shared by stateless and stateful?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really good idea, thanks for pointing it out.

This PR has been in CI purgatory for ages and finally has a green light (if you'll forgive the mixed metaphors) so I'm going to merge it and do this (for both cases) as a follow-up.

@PeteGillinElastic PeteGillinElastic merged commit 1fdea95 into elastic:main Jan 21, 2026
35 checks passed
spinscale pushed a commit to spinscale/elasticsearch that referenced this pull request Jan 21, 2026
This adds an implmentation of the helper which picks a node to
relocate a reindex task to on shutdown to be used in stateless.

This changes the mechanism for choosing the correct implementation
that was added in elastic#139081. The previous mechanism assumed that we'd
want *serverless* to behave differently, and so it defined a default
implementation and allowed for that to be swapped out via another
plugin. But we actually want *stateless* to behave differently, so we
can take the simpler approach of doing everything inside the `reindex`
plugin and just pick the correct implementation based on settings. The
implementation added previously has been renamed, because it is now
explicitly used in stateful rather than just being used by default.

Co-authored-by: Szymon Bialkowski <szybia@tuta.io>
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Reindex Issues relating to reindex that are not caused by issues further down >non-issue Team:Distributed Indexing (obsolete) Meta label for Distributed Indexing team. Obsolete. Please do not use. v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants