Skip to content

Tests that leak tasks #85700

@joegallo

Description

@joegallo

Followup to #85683

My original approach on that PR was this change:

diff --git a/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java b/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
index e9ee022f79c..d5bde29e8d1 100644
--- a/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
+++ b/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
@@ -401,6 +401,7 @@ public abstract class ESRestTestCase extends ESTestCase {
     public final void cleanUpCluster() throws Exception {
         if (preserveClusterUponCompletion() == false) {
             ensureNoInitializingShards();
+            waitForPendingTasks();
             wipeCluster();
             waitForClusterStateUpdatesToFinish();
             checkForUnexpectedlyRecreatedObjects();
@@ -439,6 +440,16 @@ public abstract class ESRestTestCase extends ESTestCase {
         return adminClient;
     }

+    /**
+     * Wait for outstanding tasks to complete. The default admin client is used to check the outstanding tasks and this is done using
+     * {@link ESTestCase#assertBusy(CheckedRunnable)} to give a chance to any outstanding tasks to complete.
+     *
+     * @throws Exception if an exception is thrown while checking the outstanding tasks
+     */
+    public static void waitForPendingTasks() throws Exception {
+        waitForPendingTasks(adminClient());
+    }
+
     /**
      * Wait for outstanding tasks to complete. The specified admin client is used to check the outstanding tasks and this is done using
      * {@link ESTestCase#assertBusy(CheckedRunnable)} to give a chance to any outstanding tasks to complete.

However, that resulted in tons of test failures, which means that there are quite a few tests out there that are leaking tasks (or something).

Here's a sample of the worst offenders:

joegallo@galactic:~/Desktop $ cat failures\ from\ when\ task\ waiting\ was\ enabled.txt | grep -a3 "active tasks" | grep rollup_sensor | wc -l
     142
joegallo@galactic:~/Desktop $ cat failures\ from\ when\ task\ waiting\ was\ enabled.txt | grep -a3 "active tasks" | grep -a3 -m1 rollup_sensor
org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT > test {yaml=reference/rollup/apis/start-job/line_50} FAILED
    java.lang.AssertionError: 1 active tasks found:
    xpack/rollup/job[c]            jXhRlwO8SmSoysY8MJjyPw:1991 cluster:1                   persistent 1649093716219 17:35:16 32.8s       127.0.0.1 node-0 rollup_sensor
     expected:<0> but was:<1>
        at __randomizedtesting.SeedInfo.seed([ED1333AE1986C19E:65470C74B77AAC66]:0)
--
joegallo@galactic:~/Desktop $ cat failures\ from\ when\ task\ waiting\ was\ enabled.txt | grep -a3 "active tasks" | grep geoip-downloader | wc -l
      18
joegallo@galactic:~/Desktop $ cat failures\ from\ when\ task\ waiting\ was\ enabled.txt | grep -a3 "active tasks" | grep -a3 -m1 geoip-downloader

org.elasticsearch.ingest.geoip.IngestGeoIpClientYamlTestSuiteIT > test {yaml=ingest_geoip/20_geoip_processor/Test geoip processor with different database file - GeoLite2-Country} FAILED
    java.lang.AssertionError: 1 active tasks found:
    geoip-downloader[c]            GRiCtkwGShORy7Z0-RsGRQ:16  cluster:1                  persistent 1649093712826 17:35:12 42.5s       127.0.0.1 yamlRestTest-0 id=geoip-downloader
     expected:<0> but was:<1>
        at __randomizedtesting.SeedInfo.seed([ED1333AE1986C19E:65470C74B77AAC66]:0)
--

While we could ignore these two tasks in the waitForPendingTasks code, I think it would be more interesting to look at preventing these kinds of leaks from happening -- it seems to me that tests are written to expect exclusive access to the cluster while they're running, and having tasks running in the background violates that. For the most part it seems this doesn't cause a problem, in that we run many thousands of tests a day and they mostly seem fine, but this does indeed cause problems sometimes (e.g. all the test failures that were closed by #85683) and the kinds of problems it causes are hard to replicate and solve, so they're a pain.

/cc @nik9000

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions