Skip to content

Not propagating TimeoutException from Retry2::awaitClose#92773

Merged
masseyke merged 2 commits intoelastic:mainfrom
masseyke:fix/Retry2-handle-TimeoutException
Jan 9, 2023
Merged

Not propagating TimeoutException from Retry2::awaitClose#92773
masseyke merged 2 commits intoelastic:mainfrom
masseyke:fix/Retry2-handle-TimeoutException

Conversation

@masseyke
Copy link
Copy Markdown
Member

@masseyke masseyke commented Jan 9, 2023

The Retry2::awaitClose waits for the given amount of time for all requests to complete. If they don't, it currently throws a TimeoutException (through the call to inFlightRequestsPhaser.awaitAdvanceInterruptibly()). This is probably excessive, and could cause problems upstream since callers aren't expecting it and might not finish closing other resources. This PR changes Retry2::awaitClose so that it catches that TimeoutException and logs a message instead.

@masseyke masseyke added >non-issue :Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. v8.7.0 labels Jan 9, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Jan 9, 2023
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Copy Markdown
Contributor

@joegallo joegallo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@masseyke masseyke merged commit 5cfe4f4 into elastic:main Jan 9, 2023
@masseyke masseyke deleted the fix/Retry2-handle-TimeoutException branch January 9, 2023 23:27
masseyke added a commit to masseyke/elasticsearch that referenced this pull request Mar 2, 2023
Logging a message rather than propagating a TimeoutException from Retry2::awaitClose
elasticsearchmachine pushed a commit that referenced this pull request Mar 2, 2023
* backporting 91238 and 86184

* increasing test timeouts (#92771)

BulkProcessor2IT can occasionally fail with timeouts like this:

```
java.util.concurrent.TimeoutException: (No message provided)

  at __randomizedtesting.SeedInfo.seed([164F04355E8E8724:9D44A00946BBB3F3]:0)
  at java.util.concurrent.Phaser.awaitAdvanceInterruptibly(Phaser.java:795)
  at org.elasticsearch.action.bulk.Retry2.awaitClose(Retry2.java:129)
  at org.elasticsearch.action.bulk.BulkProcessor2.awaitClose(BulkProcessor2.java:254)
  at org.elasticsearch.action.bulk.BulkProcessor2IT.testBulkProcessor2ConcurrentRequestsReadOnlyIndex(BulkProcessor2IT.java:197)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:833)
```

It looks like we're just cutting it a little too closely using a
1-second timeout to wait for all requests to complete. This PR bumps
that timeout to 5 seconds. In the previous version of this test
(BulkProcessorIT) the code did not actually wait for all requests to
complete, which explains why this behavior is new. Closes #92770

* fixing build problems

* reverting accidental change

* fixing build problems

* fixing a unit test

* fixing tests

* fixing tests

* Not propagating TimeoutException from Retry2::awaitClose (#92773)

Logging a message rather than propagating a TimeoutException from Retry2::awaitClose

---------

Co-authored-by: Joe Gallo <joe.gallo@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. >non-issue Team:Data Management (obsolete) DO NOT USE. This team no longer exists. v8.7.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants