Skip to content

Fixing the conditions for fetching remote master history#89472

Merged
masseyke merged 5 commits intoelastic:mainfrom
masseyke:fix/stable-master-check
Aug 22, 2022
Merged

Fixing the conditions for fetching remote master history#89472
masseyke merged 5 commits intoelastic:mainfrom
masseyke:fix/stable-master-check

Conversation

@masseyke
Copy link
Copy Markdown
Member

As part of the master stability health check, when a non-master node thinks that it has seen the master go null a number of times it checks with that master to see if it also thinks it has gone null. This check is triggered by a cluster change event telling us that the current master is null and the previous was not (and after checking that the master has gone null repeatedly). The problem is that it is possible that the master went null due to something like a long GC pause, and that same pause could mean that the check times out. Once the master comes back and we get a cluster change even that the master is not null, we're not reaching out to the master for its history. So when a user query comes in all we know is that there was a timeout exception. That means we would get a YELLOW response when we ought to get a GREEN. This commit changes the check in the cluster changed event so that it also queries the master if it has just become non-null (and still checks that the master has gone null repeatedly).
This commit also fixes a couple of minor bugs in the test code.
Closes #89431

@masseyke masseyke added >bug :Distributed/Health Issues for the health report API v8.5.0 labels Aug 18, 2022
@elasticsearchmachine elasticsearchmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Aug 18, 2022
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

@masseyke masseyke requested a review from andreidan August 22, 2022 15:11
Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for fixing this Keith

@masseyke
Copy link
Copy Markdown
Member Author

@elasticsearchmachine update branch

@masseyke
Copy link
Copy Markdown
Member Author

@elasticmachine update branch

@masseyke masseyke merged commit 5b3d51d into elastic:main Aug 22, 2022
@masseyke masseyke deleted the fix/stable-master-check branch August 22, 2022 16:18
elasticsearchmachine pushed a commit that referenced this pull request Aug 22, 2022
Unmuting the test muted by #89501 because it has been fixed in #89472
masseyke added a commit to masseyke/elasticsearch that referenced this pull request Sep 13, 2022
Fixing the conditions that the health API uses to determine when to check with a master node for its view
of master history if the master appears to have gone null repeatedly.
elasticsearchmachine pushed a commit that referenced this pull request Sep 13, 2022
…0040)

Fixing the conditions that the health API uses to determine when to check with a master node for its view
of master history if the master appears to have gone null repeatedly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed/Health Issues for the health report API Team:Data Management (obsolete) DO NOT USE. This team no longer exists. v8.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] StableMasterDisruptionIT testRepeatedNullMasterRecognizedAsGreenIfMasterDoesNotKnowItIsUnstable failing

4 participants