Skip to content

Sometimes relax lag detector in CoordinatorTests#140434

Merged
DaveCTurner merged 3 commits intoelastic:mainfrom
DaveCTurner:2026/01/09/CoordinatorTests-relaxed-lag-detector
Jan 12, 2026
Merged

Sometimes relax lag detector in CoordinatorTests#140434
DaveCTurner merged 3 commits intoelastic:mainfrom
DaveCTurner:2026/01/09/CoordinatorTests-relaxed-lag-detector

Conversation

@DaveCTurner
Copy link
Copy Markdown
Member

Related to ES-10778: since we're running with a relaxed lag detector in
some environments now we really should be covering this in the test
suite to verify that the lag detector isn't required for liveness.

Relates #108690

Related to ES-10778: since we're running with a relaxed lag detector in
some environments now we really should be covering this in the test
suite to verify that the lag detector isn't required for liveness.

Relates elastic#108690
@DaveCTurner DaveCTurner added >test Issues or PRs that are addressing/adding tests :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v9.4.0 labels Jan 9, 2026
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Jan 9, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

assertThat("leader should be last to ack", ackCollector.getSuccessfulAckIndex(leader), equalTo(1));

follower0.setClusterStateApplyResponse(ClusterStateApplyResponse.SUCCEED);
leader.submitValue(randomLong()); // follower0 acks next value allowing cluster to stabilise
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the lag detector and absent any further cluster state updates the lagging node remains lagging, but that's what we want. The point here is that if another cluster state update happens then it stops lagging again.

Comment on lines +602 to +604
cluster.runFor(DEFAULT_STABILISATION_TIME, "allowing new leader election");
cluster.getAnyLeader().submitValue(randomLong()); // old leader acks this value allowing cluster to stabilise
cluster.stabilise(DEFAULT_CLUSTER_STATE_UPDATE_DELAY);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here: stabilising the cluster will not do anything to the lagging node, we need to wait to allow for a master election, then do another cluster state update, to bring the node back in sync.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if there is a relaxed lag detector, we need to wait for the cluster to stabilise again. If there is not a relaxed lag detector, would we be in this state already?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_CLUSTER_STATE_UPDATE_DELAY is short, just a few round-trips, so this doesn't add much more time.

In practice DEFAULT_STABILISATION_TIME is long enough for the lag detector to do its thing when it's not in relaxed mode.

Comment on lines +1815 to +1818
.put(
LagDetector.CLUSTER_FOLLOWER_LAG_TIMEOUT_SETTING.getKey(),
TimeValue.timeValueMillis(defaultMillis(LagDetector.CLUSTER_FOLLOWER_LAG_TIMEOUT_SETTING))
)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is the only one that actually needs the lag detector, because we are asserting that it logs what it logs.

defaultMillis(PUBLISH_TIMEOUT_SETTING) + 2 * DEFAULT_DELAY_VARIABILITY + defaultMillis(
LagDetector.CLUSTER_FOLLOWER_LAG_TIMEOUT_SETTING
) + DEFAULT_DELAY_VARIABILITY + 2 * DEFAULT_DELAY_VARIABILITY,
DEFAULT_CLUSTER_STATE_UPDATE_DELAY + defaultMillis(PUBLISH_TIMEOUT_SETTING) + 2 * DEFAULT_DELAY_VARIABILITY
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout here apparently didn't include enough time to do the publication.

Copy link
Copy Markdown
Member

@pxsalehi pxsalehi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner enabled auto-merge (squash) January 12, 2026 10:47
Copy link
Copy Markdown
Contributor

@joshua-adams-1 joshua-adams-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit d5b2a4e into elastic:main Jan 12, 2026
35 checks passed
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jan 12, 2026
This test is kinda bogus with an atomic register because it doesn't
actually time out as claimed. Nonetheless, we do want to know that the
nacks are genuinely delivered eventually under these conditions. This
commit adjusts the test to handle the relaxed a lag detector introduced
in elastic#140434.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jan 12, 2026
This test is kinda bogus with an atomic register because it doesn't
actually time out as claimed. Nonetheless, we do want to know that the
nacks are genuinely delivered eventually under these conditions. This
commit adjusts the test to handle the relaxed a lag detector introduced
in elastic#140434.

Closes elastic#140509
jimczi pushed a commit to jimczi/elasticsearch that referenced this pull request Jan 12, 2026
Related to ES-10778: since we're running with a relaxed lag detector in
some environments now we really should be covering this in the test
suite to verify that the lag detector isn't required for liveness.

Relates elastic#108690
DaveCTurner added a commit that referenced this pull request Jan 12, 2026
This test is kinda bogus with an atomic register because it doesn't
actually time out as claimed. Nonetheless, we do want to know that the
nacks are genuinely delivered eventually under these conditions. This
commit adjusts the test to handle the relaxed lag detector introduced
in #140434.

Closes #140509
spinscale pushed a commit to spinscale/elasticsearch that referenced this pull request Jan 21, 2026
This test is kinda bogus with an atomic register because it doesn't
actually time out as claimed. Nonetheless, we do want to know that the
nacks are genuinely delivered eventually under these conditions. This
commit adjusts the test to handle the relaxed lag detector introduced
in elastic#140434.

Closes elastic#140509
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jan 26, 2026
Today you can set an arbitrarily long timeout on the `LagDetector` but
there's no facility to just completely turn it off. The usual values of
`0` and `-1` that one might expect to do so are forbidden.

There's no good reason for this any more, see e.g. elastic#140434, so with this
commit we adjust the setting to accept nonpositive timeouts and
interpret them to mean that no lag detection should take place.

Relates ES-10778
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jan 26, 2026
Today you can set an arbitrarily long timeout on the `LagDetector` but
there's no facility to just completely turn it off. The usual values of
`0` and `-1` that one might expect to do so are forbidden.

There's no good reason for this any more, see e.g. elastic#140434, so with this
commit we adjust the setting to accept nonpositive timeouts and
interpret them to mean that no lag detection should take place.

Relates ES-10778
DaveCTurner added a commit that referenced this pull request Jan 27, 2026
Today you can set an arbitrarily long timeout on the `LagDetector` but
there's no facility to just completely turn it off. The usual values of
`0` and `-1` that one might expect to do so are forbidden.

There's no good reason for this any more, see e.g. #140434, so with this
commit we adjust the setting to accept nonpositive timeouts and
interpret them to mean that no lag detection should take place.

Relates ES-10778
schase-es pushed a commit to schase-es/elasticsearch that referenced this pull request Jan 28, 2026
Today you can set an arbitrarily long timeout on the `LagDetector` but
there's no facility to just completely turn it off. The usual values of
`0` and `-1` that one might expect to do so are forbidden.

There's no good reason for this any more, see e.g. elastic#140434, so with this
commit we adjust the setting to accept nonpositive timeouts and
interpret them to mean that no lag detection should take place.

Relates ES-10778
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. >test Issues or PRs that are addressing/adding tests v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants