Skip to content

Node resolution using coordinating_only node attribute has (apparently) incorrect behaviour #139587

@PeteGillinElastic

Description

@PeteGillinElastic

Elasticsearch Version

Current, perhaps as far back as 6.4.0 (see below)

Installed Plugins

No response

Java Version

bundled

OS Version

All

Problem Description

The method DiscoveryNodes.getCoordinatingOnlyNodes() filters nodes using the predicate n.canContainData() == false && n.isMasterNode() == false && n.isIngestNode() == false. It can therefore return nodes which have e.g. the ML or transform role. This doesn't match the documented definition of "coordinating only", which is that it should have no explicit roles.

As well as being used in tests, this method is called in DiscoveryNodes.resolveNodes() when one of the arguments is coordinating_only:true or coordinating_only:false.

There appear to be a number of ways in which this behaviour is exposed to the user. To pick just one example, both GET /_tasks?nodes=coordinating_only:true (get tasks from coordinating-only nodes) or GET /_tasks?nodes=_all,coordinating_only:false (get tasks from all nodes except coordinating-only) could be affected. (That is quite an obscure use-case — I haven't found any more mainstream ones, but I haven't done an exhaustive search through the call tree.)

Affected versions: The methods above were introduced in 7.0.0. It is possible that the logic was correct then (because there were no roles which weren't data, master-eligible, or ingest) and the problematic behaviour was only introduced in 8.0.0 when the ML and transform roles were introduced (although it is hard to be sure since, prior to 8.0.0, node roles could be defined in plugins).

Steps to Reproduce

I did GET /_tasks?nodes=coordinating_only:true and GET /_tasks?nodes=_all,coordinating_only:false in a single-node dev cluster and stepped through the code in a debugger.

I am moderately confident that, if you created a cluster which included e.g. an ML-only node, GET /_tasks?nodes=coordinating_only:true would return tasks from that node (although I confess I haven't actually verified this).

Logs (if relevant)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Distributed/DistributedA catch all label for anything in the Distributed Area. Please avoid if you can.>bugTeam:DistributedMeta label for distributed team.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions