Skip to content

[BUG] Getting DeadlineTimeoutException error during security tests after upgrading to 3.0 #580

@qreshi

Description

@qreshi

What is the bug?
With the upgrade to 3.0 in #556, the dependency on org.apache.httpcomponents:httpclient and org.apache.httpcomponents:httpcore changed to org.apache.httpcomponents.client5:httpclient5 and org.apache.httpcomponents.core5:httpcore5 respectively (these are coming from OpenSearch core).

After this upgrade we're starting to see some of the tests fail during the security test CI workflow with the following error:

REPRODUCE WITH: ./gradlew ':notifications:integTest' --tests "org.opensearch.integtest.SecurityNotificationIT.test get slack notification config without get Notification permission" -Dtests.seed=C977BA49C8625070 -Dtests.security.manager=false -Dtests.locale=be -Dtests.timezone=Asia/Yakutsk -Druntime.java=17

org.opensearch.integtest.SecurityNotificationIT > test get slack notification config without get Notification permission FAILED
    java.lang.RuntimeException: error while performing request
        at __randomizedtesting.SeedInfo.seed([C977BA49C8625070:44430BFE7A23F02]:0)
        at org.opensearch.client.RestClient.extractAndWrapCause(RestClient.java:1045)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:341)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:329)
        at org.opensearch.integtest.PluginRestTestCase.createUserRolesMapping(PluginRestTestCase.kt:185)
        at org.opensearch.integtest.PluginRestTestCase.createUserWithCustomRole(PluginRestTestCase.kt:256)
        at org.opensearch.integtest.SecurityNotificationIT.test get slack notification config without get Notification permission(SecurityNotificationIT.kt:304)

        Caused by:
        org.apache.hc.core5.util.DeadlineTimeoutException: Deadline: 292278994-08-17T16:12:55.807+0900, 9223370365877384600 MILLISECONDS overdue
            at app//org.apache.hc.core5.util.DeadlineTimeoutException.from(DeadlineTimeoutException.java:49)
            at app//org.apache.hc.core5.pool.StrictConnPool.lease(StrictConnPool.java:217)
            at app//org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$1.<init>(PoolingAsyncClientConnectionManager.java:231)
            at app//org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager.lease(PoolingAsyncClientConnectionManager.java:227)
            at app//org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.acquireEndpoint(InternalHttpAsyncExecRuntime.java:100)
            at app//org.apache.hc.client5.http.impl.async.AsyncConnectExec.execute(AsyncConnectExec.java:135)
            at app//org.apache.hc.client5.http.impl.async.AsyncExecChainElement.execute(AsyncExecChainElement.java:54)
            at app//org.apache.hc.client5.http.impl.async.AsyncExecChainElement$1.proceed(AsyncExecChainElement.java:62)
            at app//org.apache.hc.client5.http.impl.async.AsyncProtocolExec.internalExecute(AsyncProtocolExec.java:198)
            at app//org.apache.hc.client5.http.impl.async.AsyncProtocolExec.execute(AsyncProtocolExec.java:158)
            at app//org.apache.hc.client5.http.impl.async.AsyncExecChainElement.execute(AsyncExecChainElement.java:54)
            at app//org.apache.hc.client5.http.impl.async.AsyncExecChainElement$1.proceed(AsyncExecChainElement.java:62)
            at app//org.apache.hc.client5.http.impl.async.AsyncRedirectExec.internalExecute(AsyncRedirectExec.java:112)
            at app//org.apache.hc.client5.http.impl.async.AsyncRedirectExec.execute(AsyncRedirectExec.java:278)
            at app//org.apache.hc.client5.http.impl.async.AsyncExecChainElement.execute(AsyncExecChainElement.java:54)
            at app//org.apache.hc.client5.http.impl.async.InternalAbstractHttpAsyncClient.executeImmediate(InternalAbstractHttpAsyncClient.java:368)
            at app//org.apache.hc.client5.http.impl.async.InternalAbstractHttpAsyncClient$1.sendRequest(InternalAbstractHttpAsyncClient.java:224)
            at app//org.apache.hc.core5.http.nio.support.BasicRequestProducer.sendRequest(BasicRequestProducer.java:93)
            at app//org.apache.hc.client5.http.impl.async.InternalAbstractHttpAsyncClient.doExecute(InternalAbstractHttpAsyncClient.java:180)
            at app//org.apache.hc.client5.http.impl.async.CloseableHttpAsyncClient.execute(CloseableHttpAsyncClient.java:97)
            at app//org.apache.hc.client5.http.impl.async.CloseableHttpAsyncClient.execute(CloseableHttpAsyncClient.java:107)
            at app//org.opensearch.client.RestClient.performRequest(RestClient.java:337)
            ... 4 more

Not all of the tests failed in this example workflow that is being referenced:

75 tests completed, 6 failed
Tests with failures:
 - org.opensearch.integtest.SecurityNotificationIT.test delete slack notification config with user that has get Notification permission
 - org.opensearch.integtest.SecurityNotificationIT.test getChannelList should return only channels with get channel permission
 - org.opensearch.integtest.SecurityNotificationIT.test get slack notification config without get Notification permission
 - org.opensearch.integtest.SecurityNotificationIT.test get slack notification config with user that has get Notification permission
 - org.opensearch.integtest.config.DeleteNotificationConfigIT.test Delete single absent notification config should fail
 - org.opensearch.integtest.config.EmailNotificationConfigCrudIT.test Create email notification config without email_group IDs

Given the stacktrace, it seems that the client used to make the calls to the cluster during the integration tests needed to obtain a lease from a connection pool and that lease either took too long or could not be obtained. We should check with the OpenSearch core team if the changes to these httpcomponents related dependencies has changed anything with the connection pool resourcing/timeouts leading to these occasional errors. Strangely enough, this error was only seen during the security test runs in the CI. The regular integration tests did not run into this problem. Either it's flaky or the conditions of the Docker security tests make it more likely to happen.

How can one reproduce the bug?
Steps to reproduce the behavior:
Run the security integration test CI in a PR or locally run the security tests against a Docker image running OpenSearch 3.0 with security

What is the expected behavior?
The client is able to get the lease in time and perform the operation during the tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions