Skip to content

Fix up ClusterServiceIT#90397

Merged
DaveCTurner merged 1 commit intoelastic:mainfrom
DaveCTurner:2022-09-27-fix-ClusterServiceIT
Sep 27, 2022
Merged

Fix up ClusterServiceIT#90397
DaveCTurner merged 1 commit intoelastic:mainfrom
DaveCTurner:2022-09-27-fix-ClusterServiceIT

Conversation

@DaveCTurner
Copy link
Copy Markdown
Member

ClusterServiceIT#testPendingUpdateTask has some unbounded waits, it relies on the clock advancing by at least 1ms which might not happen, and it leaves the cluster service thread blocked on failure which causes knock-on effects. This commit addresses these problems.

ClusterServiceIT#testPendingUpdateTask has some unbounded waits, it
relies on the clock advancing by at least 1ms which might not happen,
and it leaves the cluster service thread blocked on failure which causes
knock-on effects. This commit addresses these problems.
@DaveCTurner DaveCTurner added >test Issues or PRs that are addressing/adding tests :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.6.0 labels Sep 27, 2022
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Sep 27, 2022
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

}
Thread.sleep(100);
final var startNanoTime = System.nanoTime();
while (TimeUnit.MILLISECONDS.convert(System.nanoTime() - startNanoTime, TimeUnit.NANOSECONDS) <= 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks weird to me. Does this mean that you wait at least for 1ms to have passed? And if not, you then wait for 100ms (which defeats the purpose of the while loop -- except for rare cases where the sleep is interrupted within 1ms)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah re-reading the PR description seems to clarify this now. Indeed you want to ensure that it waits at least 1ms. Couldn't we better though try-catch that thread.sleep for InterruptedException?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encountered a test failure where even the existing 100ms sleep didn't result in System.nanoTime() returning a newer time.

I don't think we need to catch an InterruptedException here, we can just fail the test in that case and let the test runner work out what to do with the interrupt.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that seems like a nasty bug. I see, thanks for explaining.

Copy link
Copy Markdown
Contributor

@kingherc kingherc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}
Thread.sleep(100);
final var startNanoTime = System.nanoTime();
while (TimeUnit.MILLISECONDS.convert(System.nanoTime() - startNanoTime, TimeUnit.NANOSECONDS) <= 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that seems like a nasty bug. I see, thanks for explaining.

@DaveCTurner DaveCTurner merged commit be149bd into elastic:main Sep 27, 2022
@DaveCTurner DaveCTurner deleted the 2022-09-27-fix-ClusterServiceIT branch September 27, 2022 12:49
@DaveCTurner
Copy link
Copy Markdown
Member Author

Ah, I just realised the failure I was chasing was on a branch that uses a much coarser clock for these stats than System::nanoTime. This is still a good fix, just much less important than I thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team. >test Issues or PRs that are addressing/adding tests v8.6.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants