Wait for tasks to finish after the forcemerge yml test by joegallo · Pull Request #85683 · elastic/elasticsearch

joegallo · 2022-04-04T17:28:16Z

We've seen a number of tests that fail because they expect there to be zero indices or shards in the cluster, but instead there is actually a .tasks index. We were never able to reproduce these failures on demand, and our theory was that there's some random internal cluster task that happens from time to time and that you had to win the unlucky test lottery for that task to happen at the right moment for one of these tests to catch it in the act and fail.

That theory does not hold water.

We're not getting unlucky and having a random cluster task happen, rather we're getting unlucky in the order in which the tests are executed and following a specific test:

[2022-04-04T11:15:04,954][INFO ][o.e.t.r.ClientYamlTestSuiteIT] [test] There are still tasks running after this test that might break subsequent tests [indices:admin/auto_create, indices:admin/forcemerge, indices:data/write/index, internal:cluster/shard/started].
[2022-04-04T11:15:04,954][INFO ][o.e.t.r.ClientYamlTestSuiteIT] [test] [yaml=indices.forcemerge/10_basic/Force merge with wait_for_completion parameter] after test
[2022-04-04T11:15:05,008][INFO ][o.e.t.r.ClientYamlTestSuiteIT] [test] [yaml=nodes.stats/11_indices_metrics/Metric - blank for indices shard_stats] before test
[2022-04-04T11:15:05,028][INFO ][o.e.t.r.ClientYamlTestSuiteIT] [test] Stash dump on test failure [{
[...]

It is random, because the order of the tests is random and also the timing is (somewhat) random. In order to fail, these tests that care about the number of indices or shards need to follow a test that leaks a task and it turns out that that test is indices.forcemerge/10_basic/Force merge with wait_for_completion parameter. See the following snippets from the full raw log of the associated test failures: #83190 (comment), #83256 (comment), #84975 (comment), #85670 (comment).

My first commit puts one of these failing tests right next to the problematic task-leaking test so that we can reproduce the failure reliably -- I got about 10 failures in 100 runs. My second commit adds a new arity of waitForPendingTasks and calls it from cleanUpCluster, with this commit in place, I was not able to reproduce the expected failure in 100 attempts. My last commit reverts the first commit. edit: Slightly different direction now.

Closes #83190
Closes #83256
Closes #84975
Closes #85670

With this in place, you can reproduce this error something like 1 in 5 times via: ./gradlew ':rest-api-spec:yamlRestTest' --tests "org.elasticsearch.test.rest.ClientYamlTestSuiteIT" -Dtests.method="test {yaml=indices.forcemerge/*}" -Dtests.seed=20CDF00A66820798

This reverts commit 3714a2e.

elasticmachine · 2022-04-04T17:28:20Z

Pinging @elastic/es-data-management (Team:Data Management)

nik9000 · 2022-04-04T17:30:48Z

Closes #83190
Closes #83256
Closes #84975
Closes #85670

Heroic

elasticmachine · 2022-04-04T17:56:22Z

Pinging @elastic/clients-team (Team:Clients)

nik9000 · 2022-04-04T18:29:49Z

** nik9000 ** approved these changes 11 minutes ago

There are a lot of failing tests though! Hmmmm.

nik9000 · 2022-04-04T18:32:07Z

There are a lot of failing tests though! Hmmmm.

Those are real failures caused by this change. I think in the short term we might should ignore those tasks. But in the long term maybe we should shut down the things running them properly.

nik9000

Revoking my approval pending figuring out the sneaky failures in xpack. I don't know how to do that to be honest.

This reverts commit 50e2779.

This reverts commit 9fdf2d5.

joegallo · 2022-04-04T19:14:23Z

+1, I have a different approach in mind that limits the scope of the change and should still solve the problem (hopefully).

This reverts commit 85a04a7.

joegallo · 2022-04-04T20:53:25Z

Rather than changing the cluster clean up logic for all tests, I’ve just made it so this one test doesn’t leak a task.

nik9000

LGTM

I think it could be nice to assert that not tasks leak. But that's a bigger change. Are you willing to try and wrestle than one in later?

nik9000 · 2022-04-04T21:03:20Z

not tasks leak.

Or "only a specific list of tasks leak"

joegallo · 2022-04-04T21:25:50Z

++, I see the value of that -- this fixes a subset of the problem, specifically the subset that was causing a problem for some of the data management tests, but the overall class of problems is still unsolved and it would be good to make sure that all tests in general are clean in that way. I'll a file a ticket and see if I can do some broad categorization.

joegallo · 2022-04-05T14:29:37Z

Are you willing to try and wrestle than one in later?

I filed #85700 and I'll do the rounds to try to get some traction on it.

joegallo added 3 commits April 4, 2022 13:09

waitForPendingTasks in cleanUpCluster

50e2779

Revert "Copy this test yml for error reproduction"

9fdf2d5

This reverts commit 3714a2e.

joegallo added >test Issues or PRs that are addressing/adding tests :Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. v8.2.0 v7.17.3 v8.3.0 labels Apr 4, 2022

joegallo requested a review from nik9000 April 4, 2022 17:28

elasticmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Apr 4, 2022

sethmlarson added the Team:Clients Meta label for clients team label Apr 4, 2022

nik9000 approved these changes Apr 4, 2022

View reviewed changes

nik9000 requested changes Apr 4, 2022

View reviewed changes

joegallo added 3 commits April 4, 2022 14:42

Revert "waitForPendingTasks in cleanUpCluster"

cee3f5c

This reverts commit 50e2779.

Revert "Revert "Copy this test yml for error reproduction""

85a04a7

This reverts commit 9fdf2d5.

Explicitly wait for all tasks to complete

1af71ca

joegallo added 2 commits April 4, 2022 15:48

Revert "Revert "Revert "Copy this test yml for error reproduction"""

c9df153

This reverts commit 85a04a7.

Merge branch 'master' into wait-for-tasks-to-finish

d3d13bd

joegallo requested a review from nik9000 April 4, 2022 20:49

nik9000 approved these changes Apr 4, 2022

View reviewed changes

joegallo changed the title ~~Wait for tasks to finish~~ Wait for tasks to finish after the forcemerge yml test Apr 4, 2022

joegallo merged commit 92852f3 into elastic:master Apr 4, 2022

joegallo deleted the wait-for-tasks-to-finish branch April 4, 2022 21:28

joegallo removed the v7.17.3 label Apr 4, 2022

joegallo added a commit that referenced this pull request Apr 4, 2022

Wait for tasks to finish after the forcemerge yml test (#85683)

f2ce727

This was referenced Apr 5, 2022

Revert "Mute ClientYamlTestSuiteIT "Get all aliases via /_alias" and … #85694

Merged

Tests that leak tasks #85700

Open

mark-vieira mentioned this pull request Dec 21, 2022

Slow execution time for indices.forcemerge/10_basic/Force merge with wait_for_completion parameter REST test #92483

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for tasks to finish after the forcemerge yml test#85683

Wait for tasks to finish after the forcemerge yml test#85683
joegallo merged 8 commits intoelastic:masterfrom
joegallo:wait-for-tasks-to-finish

joegallo commented Apr 4, 2022 •

edited

Loading

Uh oh!

elasticmachine commented Apr 4, 2022

Uh oh!

nik9000 commented Apr 4, 2022

Uh oh!

elasticmachine commented Apr 4, 2022

Uh oh!

nik9000 commented Apr 4, 2022

Uh oh!

nik9000 commented Apr 4, 2022

Uh oh!

nik9000 left a comment

Uh oh!

joegallo commented Apr 4, 2022

Uh oh!

joegallo commented Apr 4, 2022

Uh oh!

nik9000 left a comment

Uh oh!

nik9000 commented Apr 4, 2022

Uh oh!

joegallo commented Apr 4, 2022

Uh oh!

joegallo commented Apr 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

joegallo commented Apr 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Apr 4, 2022

Uh oh!

nik9000 commented Apr 4, 2022

Uh oh!

elasticmachine commented Apr 4, 2022

Uh oh!

nik9000 commented Apr 4, 2022

Uh oh!

nik9000 commented Apr 4, 2022

Uh oh!

nik9000 left a comment

Choose a reason for hiding this comment

Uh oh!

joegallo commented Apr 4, 2022

Uh oh!

joegallo commented Apr 4, 2022

Uh oh!

nik9000 left a comment

Choose a reason for hiding this comment

Uh oh!

nik9000 commented Apr 4, 2022

Uh oh!

joegallo commented Apr 4, 2022

Uh oh!

joegallo commented Apr 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

joegallo commented Apr 4, 2022 •

edited

Loading