Skip to content

Avoiding BulkProcessor deadlock in ILMHistoryStore (#91238)#93424

Merged
elasticsearchmachine merged 10 commits intoelastic:7.17from
masseyke:backport/7.17/pr-91238
Mar 2, 2023
Merged

Avoiding BulkProcessor deadlock in ILMHistoryStore (#91238)#93424
elasticsearchmachine merged 10 commits intoelastic:7.17from
masseyke:backport/7.17/pr-91238

Conversation

@masseyke
Copy link
Copy Markdown
Member

@masseyke masseyke commented Feb 1, 2023

This backports #91238, #86184, #92771, and #92773 to 7.17

BulkProcessor2IT can occasionally fail with timeouts like this:

```
java.util.concurrent.TimeoutException: (No message provided)

  at __randomizedtesting.SeedInfo.seed([164F04355E8E8724:9D44A00946BBB3F3]:0)
  at java.util.concurrent.Phaser.awaitAdvanceInterruptibly(Phaser.java:795)
  at org.elasticsearch.action.bulk.Retry2.awaitClose(Retry2.java:129)
  at org.elasticsearch.action.bulk.BulkProcessor2.awaitClose(BulkProcessor2.java:254)
  at org.elasticsearch.action.bulk.BulkProcessor2IT.testBulkProcessor2ConcurrentRequestsReadOnlyIndex(BulkProcessor2IT.java:197)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:833)
```

It looks like we're just cutting it a little too closely using a
1-second timeout to wait for all requests to complete. This PR bumps
that timeout to 5 seconds. In the previous version of this test
(BulkProcessorIT) the code did not actually wait for all requests to
complete, which explains why this behavior is new. Closes elastic#92770
@masseyke masseyke added :Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. backport labels Feb 1, 2023
@masseyke masseyke changed the title Backporting 91238, 86184, and 92771 to 7.17 Avoiding BulkProcessor deadlock in ILMHistoryStore (#91238) Feb 1, 2023
@masseyke masseyke marked this pull request as ready for review February 7, 2023 19:25
@masseyke masseyke requested review from dakrone and joegallo February 10, 2023 14:47
joegallo and others added 2 commits March 2, 2023 12:47
Logging a message rather than propagating a TimeoutException from Retry2::awaitClose
Copy link
Copy Markdown
Contributor

@joegallo joegallo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masseyke your patience matches your attention to detail. Great work!

For the record, we did a comparison locally of the files touched by this PR as compared to the same files on master and reviewed the diff between the two a hunk at a time.

@masseyke masseyke added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Mar 2, 2023
@elasticsearchmachine elasticsearchmachine merged commit bc25aeb into elastic:7.17 Mar 2, 2023
@masseyke masseyke deleted the backport/7.17/pr-91238 branch March 2, 2023 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport :Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. v7.17.10

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants