TEST: Retry synced-flush if ongoing ops on primary#30978
TEST: Retry synced-flush if ongoing ops on primary#30978dnhatn merged 6 commits intoelastic:masterfrom
Conversation
When the last indexing operation is completed, we will fire a global checkpoint sync. Since a global checkpoint sync request is a replication request, it will acquire an index shard permit on the primary when executing. If this happens at the same time while we are issuing the synced-flush, the synced-flush request will fail as it thinks there are in-flight operations. We can avoid such situation by not issue the synced-flush until the global checkpoint on the primary is propagated to replicas.
|
Pinging @elastic/es-distributed |
|
I traced the list of index shard permit acquirers when issuing the synced-flush. |
|
run sample packaging tests |
|
I wonder if we still have a race condition - we periodically acquire a permit to do a background global checkpoint sync . Since synced flush is a best effort anyway (other stuff may fail in the future), maybe a better direction is to repeat it until successful? |
Unfortunately, this is not the case because we are testing a different kind of results: unchanged, partially successful, and entirely successful in these tests.
This is fired for every 30 seconds which I think is long enough for the synced-flush tests. How about disable the background global checkpoint sync in the synced-flush tests? |
|
Talked to Boaz, we decided to take a slightly different approach for this. We will retry another synced-flush if the current request fails due to ongoing operations on the primary. @bleskes and @jasontedor Can you please take another look? |
|
I should have provided the reasons for this change. The current approach might not cover these cases:
|
| } catch (InterruptedException e) { | ||
| Thread.currentThread().interrupt(); | ||
| if (listener.error != null) { | ||
| return; // stop here so that we can preserve the error |
There was a problem hiding this comment.
nit: just fold this into an if else?
| } | ||
| if (listener.result.failed()) { | ||
| // only retry if request failed due to ongoing operations on primary | ||
| assertThat(listener.result.failureReason(), not(containsString("ongoing operations on primary"))); |
There was a problem hiding this comment.
I personally favor something like:
if (listener.result != null && listener. result.failureReason != null &&
listener. result.failureReason.contains("ongoing operations on primary") {
throw new AsserionError(listener.reasult.failureReason); // cause the assert busy to retry
}
instead of these two ifs.
|
Thanks @jasontedor and @bleskes for reviewing! |
* elastic/master: [Tests] Muting RatedRequestsTests#testXContentParsingIsNotLenient TEST: Retry synced-flush if ongoing ops on primary (elastic#30978) Fix docs build. Only auto-update license signature if all nodes ready (elastic#30859) Add BlobContainer.writeBlobAtomic() (elastic#30902) Add a doc value format to binary fields. (elastic#30860) Take into account the return value of TcpTransport.readMessageLength(...) in Netty4SizeHeaderFrameDecoder Move caching of the size of a directory to `StoreDirectory`. (elastic#30581) Clarify docs about boolean operator precedence. (elastic#30808) Docs: remove notes on sparsity. (elastic#30905) Fix MatchPhrasePrefixQueryBuilderTests#testPhraseOnFieldWithNoTerms run overflow forecast a 2nd time as regression test for elastic/ml-cpp#110 (elastic#30969) Improve documentation of dynamic mappings. (elastic#30952) Decouple MultiValueMode. (elastic#31075) Docs: Clarify constraints on scripted similarities. (elastic#31076)
* master: Removing erroneous repeat Adapt bwc versions after backporting #30983 to 6.4 [Tests] Muting RatedRequestsTests#testXContentParsingIsNotLenient TEST: Retry synced-flush if ongoing ops on primary (#30978) Fix docs build. Only auto-update license signature if all nodes ready (#30859) Add BlobContainer.writeBlobAtomic() (#30902) Add a doc value format to binary fields. (#30860)
* ccr: [DOCS] Creates rest-api folder in docs [Rollup] Disallow index patterns that match the rollup index (elastic#30491) Replace exact numDocs by soft-del count in SegmentInfo (elastic#31086) Upgrade to Lucene-7.4.0-snapshot-0a7c3f462f (elastic#31073) Add cors support to NioHttpServerTransport (elastic#30827) [DOCS] Fixes security example (elastic#31082) Allow terms query in _rollup_search (elastic#30973) Removing erroneous repeat Adapt bwc versions after backporting elastic#30983 to 6.4 [Tests] Muting RatedRequestsTests#testXContentParsingIsNotLenient TEST: Retry synced-flush if ongoing ops on primary (elastic#30978) Fix docs build. Only auto-update license signature if all nodes ready (elastic#30859) Add BlobContainer.writeBlobAtomic() (elastic#30902) Add a doc value format to binary fields. (elastic#30860)
* ccr: [DOCS] Creates rest-api folder in docs [Rollup] Disallow index patterns that match the rollup index (elastic#30491) Replace exact numDocs by soft-del count in SegmentInfo (elastic#31086) Upgrade to Lucene-7.4.0-snapshot-0a7c3f462f (elastic#31073) Add cors support to NioHttpServerTransport (elastic#30827) [DOCS] Fixes security example (elastic#31082) Allow terms query in _rollup_search (elastic#30973) Removing erroneous repeat Adapt bwc versions after backporting elastic#30983 to 6.4 [Tests] Muting RatedRequestsTests#testXContentParsingIsNotLenient TEST: Retry synced-flush if ongoing ops on primary (elastic#30978) Fix docs build. Only auto-update license signature if all nodes ready (elastic#30859) Add BlobContainer.writeBlobAtomic() (elastic#30902) Add a doc value format to binary fields. (elastic#30860)
When the last indexing operation is completed, we will fire a global checkpoint sync. Since a global checkpoint sync request is a replication request, it will acquire an index shard permit on the primary when executing. If this happens at the same time while we are issuing the synced-flush, the synced-flush request will fail as it thinks there are in-flight operations. We can avoid such situation by retrying another synced-flush if the current request fails due to ongoing operations on the primary. Closes #29392
When the last indexing operation is completed, we will fire a global checkpoint sync. Since a global checkpoint sync request is a replication request, it will acquire an index shard permit on the primary when executing. If this happens at the same time while we are issuing the synced-flush, the synced-flush request will fail as it thinks there are in-flight operations. We can avoid such situation by retrying another synced-flush if the current request fails due to ongoing operations on the primary. Closes #29392
* 6.x: SQL: Fix build on Java 10 [Tests] Mutualize fixtures code in BaseHttpFixture (#31210) [TEST] Fix RemoteClusterClientTests#testEnsureWeReconnect [ML] Update test thresholds to account for changes to memory control (#31289) Reenable Checkstyle's unused import rule (#31270) [ML] Check licence when datafeeds use cross cluster search (#31247) Fix non-REST doc snippet [DOC] Extend SQL docs [DOCS] Shortens ML API intros Use quotes in the call invocation (#31249) move security ingest processors to a sub ingest directory (#31306) SQL: Whitelist SQL utility class for better scripting (#30681) Add 5.6.11 version constant. Fix version detection. [Docs] All Rollup docs experimental, agg limitations, clarify DeleteJob (#31299) Add missing release notes. Security: fix token bwc with pre 6.0.0-beta2 (#31254) Fix compilation error in UpdateSettingsIT (#31304) Test: Remove broken yml test feature (#31255) Add unreleased version 6.3.1 [Rollup] Metric config parser must use builder so validation runs (#31159) Removes experimental tag from scripted_metric aggregation (#31298) [DOCS] Removes coming tag from 6.3.0 release notes 6.3 release notes. Add notion of internal index settings (#31286) REST high-level client: add Cluster Health API (#29331) Remove leftover usage of deprecated client API SyncedFlushResponse to implement ToXContentObject (#31155) Add Get Aliases API to the high-level REST client (#28799) HLRest: Add get index templates API (#31161) Log warnings when cluster state publication failed to some nodes (#31233) Fix AntFixture waiting condition (#31272) [TEST] Mute RecoveryIT.testHistoryUUIDIsGenerated Ignore numeric shard count if waiting for ALL (#31265) Update checkstyle to 8.10.1 (#31269) Set analyzer version in PreBuiltAnalyzerProviderFactory (#31202) Revert upgrade to Netty 4.1.25.Final (#31282) Use armored input stream for reading public key (#31229) [DOCS] Added 'fail_on_unsupported_field' param to MLT. Closes #28008 (#31160) Fix Netty 4 Server Transport tests. Again. [DOCS] Fixed typo. [DOCS] Added release highlights for 6.3 (#31256) [DOCS] Mark SQL feature as experimental [DOCS] Updates machine learning custom URL screenshots (#31222) Fix naming conventions check for XPackTestCase Fix security Netty 4 transport tests Fix race in clear scroll (#31259) [DOCS] Clarify audit index settings when remote indexing (#30923) [ML][TEST] Mute tests using rules (#31204) Support RequestedAuthnContext (#31238) Validate xContentType in PutWatchRequest. (#31088) [INGEST] Interrupt the current thread if evaluation grok expressions take too long (#31024) Upgrade to Netty 4.1.25.Final (#31232) Suppress extras FS on caching directory tests Revert "[DOCS] Added 6.3 info & updated the upgrade table. (#30940)" Revert "Fix snippets in upgrade docs" Fix snippets in upgrade docs [DOCS] Added 6.3 info & updated the upgrade table. (#30940) Enable custom credentials for core REST tests (#31235) Move ESIndexLevelReplicationTestCase to test framework (#31243) Encapsulate Translog in Engine (#31220) [DOCS] Adds machine learning 6.3.0 release notes (#31217) Remove all unused imports and fix CRLF (#31207) [TEST] Fix testRecoveryAfterPrimaryPromotion [Docs] Remove mention pattern files in Grok processor (#31170) Use stronger write-once semantics for Azure repository (#30437) Don't swallow exceptions on replication (#31179) Compliant SAML Response destination check (#31175) Move java version checker back to its own jar (#30708) TEST: Retry synced-flush if ongoing ops on primary (#30978) [test] add fix for rare virtualbox error (#31212)
When the last indexing operation is completed, we will fire a global
checkpoint sync. Since a global checkpoint sync request is a
replication request, it will acquire an index shard permit on the
primary when executing. If this happens at the same time while we are
issuing the synced-flush, the synced-flush request will fail as it
thinks there are in-flight operations. We can avoid such situation by
retrying another synced-flush if the current request fails due to
ongoing operations on the primary.
Closes #29392