Skip to content

Handle master failure in NodeSeenService#77220

Merged
elasticsearchmachine merged 5 commits intoelastic:masterfrom
AthenaEryma:fix/shard-status-stays-complete
Sep 9, 2021
Merged

Handle master failure in NodeSeenService#77220
elasticsearchmachine merged 5 commits intoelastic:masterfrom
AthenaEryma:fix/shard-status-stays-complete

Conversation

@AthenaEryma
Copy link
Copy Markdown
Contributor

NodeSeenService can miss seeing nodes if the master changes while it's
processing the cluster state update which adds the nodes to the cluster.
This caused occasional test failures in the test intended to check that
NodeSeenService is working as intended.

This commit adjusts NodeSeenService's early returns to ensure that, if
the master changed, the new master checks for seen nodes even if nodes
were not added in that particular cluster state update.

Follow-up to #75750
Fixes #76689

NodeSeenService can miss seeing nodes if the master changes while it's
processing the cluster state update which adds the nodes to the cluster.
This caused occasional test failures in the test intended to check that
NodeSeenService is working as intended.

This commit adjusts NodeSeenService's early returns to ensure that, if
the master changed, the new master checks for seen nodes even if nodes
were not added in that particular cluster state update.
@AthenaEryma AthenaEryma requested a review from dakrone September 2, 2021 22:42
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Sep 2, 2021
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@AthenaEryma
Copy link
Copy Markdown
Contributor Author

Prior to the change to NodeSeenService, I was seeing a roughly 1%-2% failure rate running locally. After the change, I ran several hundred runs with zero failures.

Copy link
Copy Markdown
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this

@AthenaEryma
Copy link
Copy Markdown
Contributor Author

@dakrone I realized that I was missing an == false on the condition - as it was, the cluster state would get updated and the test would pass, but submit update tasks when not necessary to do so. I've updated it to be easier to read, and run the new version several hundred times as well.

@AthenaEryma AthenaEryma requested a review from dakrone September 8, 2021 22:11
Copy link
Copy Markdown
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying it by pulling it into a variable, after removing the logging it LGTM

final boolean thisNodeJustBecameMaster = event.previousState().nodes().isLocalNodeElectedMaster() == false
&& event.state().nodes().isLocalNodeElectedMaster();
if ((event.nodesAdded() || thisNodeJustBecameMaster) == false) {
logger.error("GWB> Bailing early");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably remove this line :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I had those in there to help verify my reading that the == false was missing 🤦

Removed.

.collect(Collectors.toUnmodifiableSet());

if (nodesNotPreviouslySeen.isEmpty() == false) {
logger.error("GWB> Submitting update task for nodes [{}]", nodesNotPreviouslySeen);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one :)


final NodesShutdownMetadata newNodesMetadata = new NodesShutdownMetadata(newShutdownMetadataMap);
if (newNodesMetadata.equals(currentShutdownMetadata)) {
logger.error("GWB> Bailing update task as it's a no-op");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one :)

@AthenaEryma AthenaEryma added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 9, 2021
@AthenaEryma
Copy link
Copy Markdown
Contributor Author

@elasticmachine update branch

@elasticsearchmachine elasticsearchmachine merged commit 62fed2f into elastic:master Sep 9, 2021
AthenaEryma added a commit to AthenaEryma/elasticsearch that referenced this pull request Sep 9, 2021
* Handle master failure in NodeSeenService

NodeSeenService can miss seeing nodes if the master changes while it's
processing the cluster state update which adds the nodes to the cluster.
This caused occasional test failures in the test intended to check that
NodeSeenService is working as intended.

This commit adjusts NodeSeenService's early returns to ensure that, if
the master changed, the new master checks for seen nodes even if nodes
were not added in that particular cluster state update.

* Clarify & correct "just-became-master" check

* Remove leftover debug logging (d'oh!)

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
AthenaEryma added a commit to AthenaEryma/elasticsearch that referenced this pull request Sep 9, 2021
* Handle master failure in NodeSeenService

NodeSeenService can miss seeing nodes if the master changes while it's
processing the cluster state update which adds the nodes to the cluster.
This caused occasional test failures in the test intended to check that
NodeSeenService is working as intended.

This commit adjusts NodeSeenService's early returns to ensure that, if
the master changed, the new master checks for seen nodes even if nodes
were not added in that particular cluster state update.

* Clarify & correct "just-became-master" check

* Remove leftover debug logging (d'oh!)

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💚 Backport successful

Status Branch Result
7.x
7.15

elasticsearchmachine pushed a commit that referenced this pull request Sep 9, 2021
* Handle master failure in NodeSeenService

NodeSeenService can miss seeing nodes if the master changes while it's
processing the cluster state update which adds the nodes to the cluster.
This caused occasional test failures in the test intended to check that
NodeSeenService is working as intended.

This commit adjusts NodeSeenService's early returns to ensure that, if
the master changed, the new master checks for seen nodes even if nodes
were not added in that particular cluster state update.

* Clarify & correct "just-became-master" check

* Remove leftover debug logging (d'oh!)

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
elasticsearchmachine pushed a commit that referenced this pull request Sep 9, 2021
* Handle master failure in NodeSeenService

NodeSeenService can miss seeing nodes if the master changes while it's
processing the cluster state update which adds the nodes to the cluster.
This caused occasional test failures in the test intended to check that
NodeSeenService is working as intended.

This commit adjusts NodeSeenService's early returns to ensure that, if
the master changed, the new master checks for seen nodes even if nodes
were not added in that particular cluster state update.

* Clarify & correct "just-became-master" check

* Remove leftover debug logging (d'oh!)

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
@jakelandis jakelandis removed the v8.0.0 label Sep 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown Team:Core/Infra Meta label for core/infra team v7.15.1 v7.16.0 v8.0.0-alpha2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] NodeShutdownShardsIT testShardStatusStaysCompleteAfterNodeLeavesIfRegisteredWhileNodeOffline failing

5 participants