Followup to #85683
My original approach on that PR was this change:
diff --git a/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java b/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
index e9ee022f79c..d5bde29e8d1 100644
--- a/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
+++ b/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
@@ -401,6 +401,7 @@ public abstract class ESRestTestCase extends ESTestCase {
public final void cleanUpCluster() throws Exception {
if (preserveClusterUponCompletion() == false) {
ensureNoInitializingShards();
+ waitForPendingTasks();
wipeCluster();
waitForClusterStateUpdatesToFinish();
checkForUnexpectedlyRecreatedObjects();
@@ -439,6 +440,16 @@ public abstract class ESRestTestCase extends ESTestCase {
return adminClient;
}
+ /**
+ * Wait for outstanding tasks to complete. The default admin client is used to check the outstanding tasks and this is done using
+ * {@link ESTestCase#assertBusy(CheckedRunnable)} to give a chance to any outstanding tasks to complete.
+ *
+ * @throws Exception if an exception is thrown while checking the outstanding tasks
+ */
+ public static void waitForPendingTasks() throws Exception {
+ waitForPendingTasks(adminClient());
+ }
+
/**
* Wait for outstanding tasks to complete. The specified admin client is used to check the outstanding tasks and this is done using
* {@link ESTestCase#assertBusy(CheckedRunnable)} to give a chance to any outstanding tasks to complete.
However, that resulted in tons of test failures, which means that there are quite a few tests out there that are leaking tasks (or something).
Here's a sample of the worst offenders:
joegallo@galactic:~/Desktop $ cat failures\ from\ when\ task\ waiting\ was\ enabled.txt | grep -a3 "active tasks" | grep rollup_sensor | wc -l
142
joegallo@galactic:~/Desktop $ cat failures\ from\ when\ task\ waiting\ was\ enabled.txt | grep -a3 "active tasks" | grep -a3 -m1 rollup_sensor
org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT > test {yaml=reference/rollup/apis/start-job/line_50} FAILED
java.lang.AssertionError: 1 active tasks found:
xpack/rollup/job[c] jXhRlwO8SmSoysY8MJjyPw:1991 cluster:1 persistent 1649093716219 17:35:16 32.8s 127.0.0.1 node-0 rollup_sensor
expected:<0> but was:<1>
at __randomizedtesting.SeedInfo.seed([ED1333AE1986C19E:65470C74B77AAC66]:0)
--
joegallo@galactic:~/Desktop $ cat failures\ from\ when\ task\ waiting\ was\ enabled.txt | grep -a3 "active tasks" | grep geoip-downloader | wc -l
18
joegallo@galactic:~/Desktop $ cat failures\ from\ when\ task\ waiting\ was\ enabled.txt | grep -a3 "active tasks" | grep -a3 -m1 geoip-downloader
org.elasticsearch.ingest.geoip.IngestGeoIpClientYamlTestSuiteIT > test {yaml=ingest_geoip/20_geoip_processor/Test geoip processor with different database file - GeoLite2-Country} FAILED
java.lang.AssertionError: 1 active tasks found:
geoip-downloader[c] GRiCtkwGShORy7Z0-RsGRQ:16 cluster:1 persistent 1649093712826 17:35:12 42.5s 127.0.0.1 yamlRestTest-0 id=geoip-downloader
expected:<0> but was:<1>
at __randomizedtesting.SeedInfo.seed([ED1333AE1986C19E:65470C74B77AAC66]:0)
--
While we could ignore these two tasks in the waitForPendingTasks code, I think it would be more interesting to look at preventing these kinds of leaks from happening -- it seems to me that tests are written to expect exclusive access to the cluster while they're running, and having tasks running in the background violates that. For the most part it seems this doesn't cause a problem, in that we run many thousands of tests a day and they mostly seem fine, but this does indeed cause problems sometimes (e.g. all the test failures that were closed by #85683) and the kinds of problems it causes are hard to replicate and solve, so they're a pain.
/cc @nik9000
Followup to #85683
My original approach on that PR was this change:
However, that resulted in tons of test failures, which means that there are quite a few tests out there that are leaking tasks (or something).
Here's a sample of the worst offenders:
While we could ignore these two tasks in the
waitForPendingTaskscode, I think it would be more interesting to look at preventing these kinds of leaks from happening -- it seems to me that tests are written to expect exclusive access to the cluster while they're running, and having tasks running in the background violates that. For the most part it seems this doesn't cause a problem, in that we run many thousands of tests a day and they mostly seem fine, but this does indeed cause problems sometimes (e.g. all the test failures that were closed by #85683) and the kinds of problems it causes are hard to replicate and solve, so they're a pain./cc @nik9000