Wait for tasks to finish after the forcemerge yml test#85683
Wait for tasks to finish after the forcemerge yml test#85683joegallo merged 8 commits intoelastic:masterfrom
Conversation
With this in place, you can reproduce this error something like 1 in 5 times via:
./gradlew ':rest-api-spec:yamlRestTest' --tests "org.elasticsearch.test.rest.ClientYamlTestSuiteIT" -Dtests.method="test {yaml=indices.forcemerge/*}" -Dtests.seed=20CDF00A66820798
This reverts commit 3714a2e.
|
Pinging @elastic/es-data-management (Team:Data Management) |
|
Pinging @elastic/clients-team (Team:Clients) |
There are a lot of failing tests though! Hmmmm. |
Those are real failures caused by this change. I think in the short term we might should ignore those tasks. But in the long term maybe we should shut down the things running them properly. |
nik9000
left a comment
There was a problem hiding this comment.
Revoking my approval pending figuring out the sneaky failures in xpack. I don't know how to do that to be honest.
|
+1, I have a different approach in mind that limits the scope of the change and should still solve the problem (hopefully). |
|
Rather than changing the cluster clean up logic for all tests, I’ve just made it so this one test doesn’t leak a task. |
nik9000
left a comment
There was a problem hiding this comment.
LGTM
I think it could be nice to assert that not tasks leak. But that's a bigger change. Are you willing to try and wrestle than one in later?
Or "only a specific list of tasks leak" |
|
++, I see the value of that -- this fixes a subset of the problem, specifically the subset that was causing a problem for some of the data management tests, but the overall class of problems is still unsolved and it would be good to make sure that all tests in general are clean in that way. I'll a file a ticket and see if I can do some broad categorization. |
I filed #85700 and I'll do the rounds to try to get some traction on it. |
We've seen a number of tests that fail because they expect there to be zero indices or shards in the cluster, but instead there is actually a
.tasksindex. We were never able to reproduce these failures on demand, and our theory was that there's some random internal cluster task that happens from time to time and that you had to win the unlucky test lottery for that task to happen at the right moment for one of these tests to catch it in the act and fail.That theory does not hold water.
We're not getting unlucky and having a random cluster task happen, rather we're getting unlucky in the order in which the tests are executed and following a specific test:
It is random, because the order of the tests is random and also the timing is (somewhat) random. In order to fail, these tests that care about the number of indices or shards need to follow a test that leaks a task and it turns out that that test is
indices.forcemerge/10_basic/Force merge with wait_for_completion parameter. See the following snippets from the full raw log of the associated test failures: #83190 (comment), #83256 (comment), #84975 (comment), #85670 (comment).My first commit puts one of these failing tests right next to the problematic task-leaking test so that we can reproduce the failure reliably -- I got about 10 failures in 100 runs. My second commit adds a new arity ofedit: Slightly different direction now.waitForPendingTasksand calls it fromcleanUpCluster, with this commit in place, I was not able to reproduce the expected failure in 100 attempts. My last commit reverts the first commit.Closes #83190
Closes #83256
Closes #84975
Closes #85670