Skip to content

Wait for tasks to finish after the forcemerge yml test#85683

Merged
joegallo merged 8 commits intoelastic:masterfrom
joegallo:wait-for-tasks-to-finish
Apr 4, 2022
Merged

Wait for tasks to finish after the forcemerge yml test#85683
joegallo merged 8 commits intoelastic:masterfrom
joegallo:wait-for-tasks-to-finish

Conversation

@joegallo
Copy link
Copy Markdown
Contributor

@joegallo joegallo commented Apr 4, 2022

We've seen a number of tests that fail because they expect there to be zero indices or shards in the cluster, but instead there is actually a .tasks index. We were never able to reproduce these failures on demand, and our theory was that there's some random internal cluster task that happens from time to time and that you had to win the unlucky test lottery for that task to happen at the right moment for one of these tests to catch it in the act and fail.

That theory does not hold water.

We're not getting unlucky and having a random cluster task happen, rather we're getting unlucky in the order in which the tests are executed and following a specific test:

[2022-04-04T11:15:04,954][INFO ][o.e.t.r.ClientYamlTestSuiteIT] [test] There are still tasks running after this test that might break subsequent tests [indices:admin/auto_create, indices:admin/forcemerge, indices:data/write/index, internal:cluster/shard/started].
[2022-04-04T11:15:04,954][INFO ][o.e.t.r.ClientYamlTestSuiteIT] [test] [yaml=indices.forcemerge/10_basic/Force merge with wait_for_completion parameter] after test
[2022-04-04T11:15:05,008][INFO ][o.e.t.r.ClientYamlTestSuiteIT] [test] [yaml=nodes.stats/11_indices_metrics/Metric - blank for indices shard_stats] before test
[2022-04-04T11:15:05,028][INFO ][o.e.t.r.ClientYamlTestSuiteIT] [test] Stash dump on test failure [{
[...]

It is random, because the order of the tests is random and also the timing is (somewhat) random. In order to fail, these tests that care about the number of indices or shards need to follow a test that leaks a task and it turns out that that test is indices.forcemerge/10_basic/Force merge with wait_for_completion parameter. See the following snippets from the full raw log of the associated test failures: #83190 (comment), #83256 (comment), #84975 (comment), #85670 (comment).

My first commit puts one of these failing tests right next to the problematic task-leaking test so that we can reproduce the failure reliably -- I got about 10 failures in 100 runs. My second commit adds a new arity of waitForPendingTasks and calls it from cleanUpCluster, with this commit in place, I was not able to reproduce the expected failure in 100 attempts. My last commit reverts the first commit. edit: Slightly different direction now.

Closes #83190
Closes #83256
Closes #84975
Closes #85670

joegallo added 3 commits April 4, 2022 13:09
With this in place, you can reproduce this error something like 1 in 5 times via:

./gradlew ':rest-api-spec:yamlRestTest' --tests "org.elasticsearch.test.rest.ClientYamlTestSuiteIT" -Dtests.method="test {yaml=indices.forcemerge/*}" -Dtests.seed=20CDF00A66820798
@joegallo joegallo added >test Issues or PRs that are addressing/adding tests :Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. v8.2.0 v7.17.3 v8.3.0 labels Apr 4, 2022
@joegallo joegallo requested a review from nik9000 April 4, 2022 17:28
@elasticmachine elasticmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Apr 4, 2022
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@nik9000
Copy link
Copy Markdown
Member

nik9000 commented Apr 4, 2022

Closes #83190
Closes #83256
Closes #84975
Closes #85670

Heroic

@sethmlarson sethmlarson added the Team:Clients Meta label for clients team label Apr 4, 2022
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/clients-team (Team:Clients)

@nik9000
Copy link
Copy Markdown
Member

nik9000 commented Apr 4, 2022

** nik9000 ** approved these changes 11 minutes ago

There are a lot of failing tests though! Hmmmm.

@nik9000
Copy link
Copy Markdown
Member

nik9000 commented Apr 4, 2022

There are a lot of failing tests though! Hmmmm.

Those are real failures caused by this change. I think in the short term we might should ignore those tasks. But in the long term maybe we should shut down the things running them properly.

Copy link
Copy Markdown
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revoking my approval pending figuring out the sneaky failures in xpack. I don't know how to do that to be honest.

@joegallo
Copy link
Copy Markdown
Contributor Author

joegallo commented Apr 4, 2022

+1, I have a different approach in mind that limits the scope of the change and should still solve the problem (hopefully).

@joegallo joegallo requested a review from nik9000 April 4, 2022 20:49
@joegallo
Copy link
Copy Markdown
Contributor Author

joegallo commented Apr 4, 2022

Rather than changing the cluster clean up logic for all tests, I’ve just made it so this one test doesn’t leak a task.

Copy link
Copy Markdown
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think it could be nice to assert that not tasks leak. But that's a bigger change. Are you willing to try and wrestle than one in later?

@nik9000
Copy link
Copy Markdown
Member

nik9000 commented Apr 4, 2022

not tasks leak.

Or "only a specific list of tasks leak"

@joegallo
Copy link
Copy Markdown
Contributor Author

joegallo commented Apr 4, 2022

++, I see the value of that -- this fixes a subset of the problem, specifically the subset that was causing a problem for some of the data management tests, but the overall class of problems is still unsolved and it would be good to make sure that all tests in general are clean in that way. I'll a file a ticket and see if I can do some broad categorization.

@joegallo joegallo changed the title Wait for tasks to finish Wait for tasks to finish after the forcemerge yml test Apr 4, 2022
@joegallo joegallo merged commit 92852f3 into elastic:master Apr 4, 2022
@joegallo joegallo deleted the wait-for-tasks-to-finish branch April 4, 2022 21:28
@joegallo joegallo removed the v7.17.3 label Apr 4, 2022
@joegallo
Copy link
Copy Markdown
Contributor Author

joegallo commented Apr 5, 2022

Are you willing to try and wrestle than one in later?

I filed #85700 and I'll do the rounds to try to get some traction on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. Team:Clients Meta label for clients team Team:Data Management (obsolete) DO NOT USE. This team no longer exists. >test Issues or PRs that are addressing/adding tests v8.2.0 v8.3.0

Projects

None yet

4 participants