Skip to content

Wait on readiness service on start#102325

Open
rjernst wants to merge 5 commits intoelastic:mainfrom
rjernst:startup/wait_for_ready
Open

Wait on readiness service on start#102325
rjernst wants to merge 5 commits intoelastic:mainfrom
rjernst:startup/wait_for_ready

Conversation

@rjernst
Copy link
Copy Markdown
Member

@rjernst rjernst commented Nov 16, 2023

Node.start() currently does not return until the node is serving http. However, other parts of the system may not be quite ready. For example, file settings may not have been applied, and we may not yet have a master node. The readiness service's purpose is to identify when the node is actually ready for serving requests.

This commit adjusts the end of Node start to wait on the readiness service being ready. The ramifications are that the main Elasticsearch thread will not exit until the node is actually ready to serve requests, and the cli will not exit (when in daemon mode) until the node is ready.

closes #136955

Node.start() currently does not return until the node is serving http.
However, other parts of the system may not be quite ready. For example,
file settings may not have been applied, and we may not yet have a
master node. The readiness service's purpose is to identify when the
node is actually ready for serving requests.

This commit adjusts the end of Node start to wait on the readiness
service being ready. The ramifications are that the main Elasticsearch
thread will not exit until the node is actually ready to serve requests,
and the cli will not exit (when in daemon mode) until the node is ready.
@rjernst rjernst added >refactoring :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown labels Nov 16, 2023
@rjernst rjernst requested a review from a team November 16, 2023 23:48
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine elasticsearchmachine added v8.12.0 Team:Core/Infra Meta label for core/infra team labels Nov 16, 2023
Copy link
Copy Markdown
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, only a minor Q

CountDownLatch ready = new CountDownLatch(1);
readinessService.addBoundAddressListener(address -> ready.countDown());
try {
ready.await();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not introducing a timeout here - I suppose there are plenty of other places where we already timeout so there is no need for another one here, correct?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep exactly, there are other timeouts on startup eg systemd has a 75 second default timeout. Right now we will actually report to systemd that the node is ready before the readiness probe responds true, but this change will align them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown >refactoring Team:Core/Infra Meta label for core/infra team v8.19.15 v9.1.11 v9.2.9 v9.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] ReadinessClusterIT testReadinessDuringRestartsNormalOrder failing

5 participants