Add integ test for simulating node join left event when data node clu…#19907
Conversation
|
❌ Gradle check result for 9d64a60: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Outdated
Show resolved
Hide resolved
9d64a60 to
8994161
Compare
|
❌ Gradle check result for 8994161: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Failure due to org.opensearch.arrow.flight.stats.FlightMetricsTests.testComprehensiveMetrics Existing issue: #18947 |
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Show resolved
Hide resolved
8994161 to
49169d2
Compare
📝 WalkthroughWalkthroughAdds a test-only index-store listener plugin, a gating annotation, two integration tests that simulate cluster-state publication lag (including long-running applier-thread delays), small test-framework log-capture APIs, and an Unreleased changelog entry. No production public API changes. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant CM as ClusterManager
participant Applier as CM_ApplierThread
participant Store as TestIndexStoreListener
participant Data as DataNode
CM->>Data: Publish cluster state (node join/leave)
CM->>Applier: Schedule apply of cluster state
Applier->>Store: Invoke IndexStoreListener callback (e.g., shard deletion)
Store-->>Applier: Introduce delay/block (simulate lag)
Applier-->>CM: Applier completion delayed
CM->>Data: Publication may timeout / join may fail (logged)
Note over CM,Data: After delay removed, cluster recovers and shards migrate
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
test/framework/src/main/java/org/opensearch/test/TestLogsAppender.java (1)
39-44: Consider clarifying the deduplication logic.The reuse of
shouldCaptureMessagefor pattern deduplication works but is confusing, as that method was designed to check if logs match patterns (used inappend), not to deduplicate patterns. This leads to asymmetric behavior: ifmessagesToCapturecontains "foo", adding "foobar" is blocked, but if it contains "foobar", "foo" can still be added.Consider either documenting this behavior explicitly or extracting a separate
isDuplicatePatternmethod for clarity.server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (1)
128-133: Consider using List.of() for conciseness.The instance initializer block works but is more verbose than necessary.
Apply this diff to simplify:
- List<String> messagesToCapture = new ArrayList<String>() { - { - add("failed to join"); - add("IllegalStateException"); - } - }; + List<String> messagesToCapture = new ArrayList<>(List.of( + "failed to join", + "IllegalStateException" + ));
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
CHANGELOG.md(1 hunks)server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java(8 hunks)test/framework/src/main/java/org/opensearch/test/TestLogsAppender.java(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: gradle-check
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: Analyze (java)
🔇 Additional comments (3)
test/framework/src/main/java/org/opensearch/test/TestLogsAppender.java (1)
35-37: LGTM!The batch addition method appropriately delegates to the single-message method.
CHANGELOG.md (1)
36-37: LGTM!The changelog entry accurately describes the new integration test and is properly formatted.
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (1)
472-477: LGTM!The plugin implementation correctly provides the test listener.
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Show resolved
Hide resolved
|
❌ Gradle check result for 49169d2: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
49169d2 to
3297c7c
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (1)
161-168: Duplicate setting forCLUSTER_NODE_RECONNECT_INTERVAL_SETTING
nodeSettingscurrently setsNodeConnectionsService.CLUSTER_NODE_RECONNECT_INTERVAL_SETTINGtwice with different values:.put(NodeConnectionsService.CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.getKey(), "10s") ... .put(NodeConnectionsService.CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.getKey(), "100ms")The second call overrides the first, which is misleading for readers.
Consider removing the unused
"10s"entry or consolidating to a single put with the intended value.
🧹 Nitpick comments (5)
CHANGELOG.md (1)
39-39: Clarify changelog wording for readabilityThe sentence is a bit awkward; consider rephrasing to something like:
- Add integ test for simulating node join left event when data node cluster state publication lag because the cluster applier thread being busy + Add integ test to simulate node join/left events when data node cluster state publication lags because the cluster applier thread is busyserver/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (4)
353-406: Static delay flag handling is correct but contains a redundant resetThe pattern of enabling
TestIndexStoreListener.DELAY_SHARD_ASSIGNMENTin atryblock and clearing it infinallyis good and protects against leakage across tests. The explicit assignment back tofalseat Line 390 is redundant because thefinallyblock already handles cleanup.You can safely drop the in-body reset and rely on
finallyto reduce duplication.
472-495: Minor logging clarity in error handlerIn the error callback you currently log:
(source, e) -> logger.error(() -> new ParameterizedMessage("{} unexpected error in listener wait", e), e)Using
eas the placeholder argument makes the formatted message slightly odd and also ignores thesourceparameter. A more typical pattern would incorporatesource:- }, (source, e) -> logger.error(() -> new ParameterizedMessage("{} unexpected error in listener wait", e), e)); + }, (source, e) -> logger.error( + () -> new ParameterizedMessage("{} unexpected error in listener wait", source), + e + ));This way the message clearly indicates which operation/listener failed while still logging the exception.
374-375: Keep delay-related constants and log expectations in syncYou introduced
SHARD_DELETE_DELAY_SECONDS = 30inTestIndexStoreListener, and both tests wait for a"Sleeping for 30 seconds"log with a 30-second timeout. The pieces line up now, but they are partially duplicated:
- Constant:
SHARD_DELETE_DELAY_SECONDS- Log content and
waitForLogtimeout values:30To avoid future drift if the delay is tuned, consider:
- Using
SHARD_DELETE_DELAY_SECONDS(or a derived millis value) to drive both the log text and thewaitForLogtimeout; and/or- Centralizing the
"Sleeping for {} seconds"message format so tests don't need to hard-code"30".This keeps the delay behavior and assertions consistent if you later adjust the sleep duration.
Also applies to: 433-434, 608-611, 624-635
503-544: Cluster settings scope is acceptable withScope.TEST, but worth documenting
setupClusterForPublicationLagTestsets transient cluster settings (follower lag timeout, publish timeout, concurrent recoveries) without explicitly restoring them. Given the class is annotated with@ClusterScope(scope = Scope.TEST, numDataNodes = 0), each test gets its own cluster, so these settings won't leak to other tests.It may still be helpful to add a short comment noting that the
Scope.TESTcluster lifecycle isolates these settings, so future refactors (e.g., changing scope) don't accidentally introduce cross-test coupling.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
CHANGELOG.md(1 hunks)server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java(8 hunks)server/src/internalClusterTest/java/org/opensearch/cluster/coordination/RequiresTestIndexStoreListener.java(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
- GitHub Check: gradle-check
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: Analyze (java)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: Mend Security Check
🔇 Additional comments (3)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/RequiresTestIndexStoreListener.java (1)
16-21: Annotation definition looks appropriateMarker annotation scoped to methods with RUNTIME retention is a good fit for driving conditional plugin loading in tests. No changes needed.
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (2)
109-123: Annotation-based plugin gating is a solid improvementUsing
@RequiresTestIndexStoreListenerandrequiresTestIndexStoreListener()to decide whether to includeTestIndexStoreListenerPluginmakes the test setup explicit and less fragile than string-based test-name checks. The reflective lookup with a safe fallback is reasonable for this usage.Also applies to: 125-132
599-641: IndexStoreListener plugin and delay behavior look correctThe test-only
TestIndexStoreListenerPluginandTestIndexStoreListenerimplementation correctly:
- Registers via the
IndexStorePluginSPI.- Uses a
volatilestatic flag to toggle delay.- Applies a bounded, logged sleep around shard-path deletion.
- Restores interrupt status when interrupted.
This is a clean way to simulate slow shard cleanup on the applier thread; no changes needed.
|
❌ Gradle check result for 3297c7c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Outdated
Show resolved
Hide resolved
db09f4f to
8270f3a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (1)
375-378: Variable shadowing: localnodeSettingsshadows instance field.The local variable
nodeSettingsat line 376 shadows the instance field declared at line 104. This could lead to maintenance confusion.for (String nodeName : internalCluster().getNodeNames()) { - Settings nodeSettings = internalCluster().getInstance(Settings.class, nodeName); - if ("red".equals(nodeSettings.get("node.attr.color")) && nodeSettings.getAsBoolean("node.data", true)) { + Settings instanceSettings = internalCluster().getInstance(Settings.class, nodeName); + if ("red".equals(instanceSettings.get("node.attr.color")) && instanceSettings.getAsBoolean("node.data", true)) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
CHANGELOG.md(1 hunks)server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java(9 hunks)server/src/internalClusterTest/java/org/opensearch/cluster/coordination/RequiresTestIndexStoreListener.java(1 hunks)test/framework/src/main/java/org/opensearch/test/TestLogsAppender.java(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- server/src/internalClusterTest/java/org/opensearch/cluster/coordination/RequiresTestIndexStoreListener.java
- CHANGELOG.md
- test/framework/src/main/java/org/opensearch/test/TestLogsAppender.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)
- GitHub Check: gradle-check
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: Analyze (java)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (21, macos-15)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: detect-breaking-change
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: Mend Security Check
🔇 Additional comments (5)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (5)
107-130: LGTM on annotation-based plugin loading.The reflection-based approach for detecting
@RequiresTestIndexStoreListeneris more robust than the previous string-based method name matching. The fallback tofalseonNoSuchMethodExceptionhandles edge cases gracefully.
351-368: LGTM on static flag cleanup.The try-finally pattern ensures
DELAY_SHARD_ASSIGNMENTis reset even if the test fails mid-execution, preventing test pollution.
415-422: Transient settings are isolated by test scope.The previous review flagged potential setting leakage, but since
@ClusterScope(scope = Scope.TEST)creates a fresh cluster per test method, transient settings won't persist across tests. The current approach is acceptable, though explicit reset in a finally block would be more defensive for future refactoring.
564-601: LGTM on TestIndexStoreListener implementation.The implementation correctly addresses previous review feedback:
- Uses a named constant
SHARD_DELETE_DELAY_SECONDSinstead of magic number- Properly restores interrupt status with
Thread.currentThread().interrupt()(line 593)volatilemodifier onDELAY_SHARD_ASSIGNMENTensures visibility across threads
440-490: Validation logic is comprehensive.The validation methods properly verify:
- Node drop occurrence via cluster health and specific log patterns
- Cluster recovery with correct node count and shard activation
- Shard migration to red nodes using routing table inspection
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Show resolved
Hide resolved
|
❌ Gradle check result for 8270f3a: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
8270f3a to
1ec8f19
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (1)
158-167: DuplicateCLUSTER_NODE_RECONNECT_INTERVAL_SETTINGkey innodeSettings.The builder sets
NodeConnectionsService.CLUSTER_NODE_RECONNECT_INTERVAL_SETTINGtwice:.put(NodeConnectionsService.CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.getKey(), "10s") ... .put(NodeConnectionsService.CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.getKey(), "100ms")The second call silently overwrites the first, so the effective interval is
"100ms"and the"10s"value is dead configuration.Please remove one of them (and keep only the intended value) to avoid confusion:
- .put(NodeConnectionsService.CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.getKey(), "10s") .put(FollowersChecker.FOLLOWER_CHECK_TIMEOUT_SETTING.getKey(), "200ms") .put(FollowersChecker.FOLLOWER_CHECK_INTERVAL_SETTING.getKey(), "100ms") .put(FollowersChecker.FOLLOWER_CHECK_RETRY_COUNT_SETTING.getKey(), 1) .put(NodeConnectionsService.CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.getKey(), "100ms")(or vice versa, if
"10s"was intended).
🧹 Nitpick comments (4)
test/framework/src/main/java/org/opensearch/test/TestLogsAppender.java (1)
18-44: Capture API semantics are correct but a bit subtle.
addMessageToCaptureonly appends whenshouldCaptureMessage(message)is false, effectively treatingmessagesToCaptureas a set of substring patterns and avoiding duplicates. That works for the current usage (adding distinct patterns like"Sleeping for 30 seconds"and"reason: lagging"), but it’s not immediately obvious from the code.If you want to make this clearer and a bit more robust:
- Consider explicitly checking for duplicates via
messagesToCapture.contains(message)instead of reusingshouldCaptureMessage, which is conceptually “does this log contain any capture pattern?” rather than “is this pattern already present?”.- Alternatively, change
messagesToCaptureto aSet<String>(if ordering doesn’t matter) to encode uniqueness directly.Not a blocker, but this would make the behavior easier to reason about for future readers.
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (3)
140-151: SimplifymessagesToCaptureinitialization to avoid double-brace idiom.The anonymous
ArrayList<String>() { { ... } }pattern works but creates an extra anonymous class and is harder to read.You can express the same thing more simply and keep the list mutable:
- List<String> messagesToCapture = new ArrayList<String>() { - { - add("failed to join"); - add("IllegalStateException"); - } - }; + List<String> messagesToCapture = new ArrayList<>(); + messagesToCapture.add("failed to join"); + messagesToCapture.add("IllegalStateException");This keeps compatibility with the later
addMessagesToCapturecalls while making the setup more straightforward.
440-467: Log assertions are strong; consider test runtime if logs don’t appear.
validateNodeDropDueToPublicationLag()thoroughly checks for:
- The cluster dropping below 9 nodes,
- The 30s delay log,
- The node-removal task log with
NodeRemovalClusterStateTaskExecutorandreason: lagging, and- The join failures.
Each
waitForLoguses up to 30 seconds, and combined with the health wait of up to 120 seconds, the test could run for a while if something goes wrong. That’s acceptable for a heavy integ test, but if you ever see CI timeouts here, reducing some of these per-log timeouts (or sharing a common deadline) could help.No change required now; just something to keep in mind if you tune test runtimes later.
492-502: Reuse shared delay constant for the 30s applier-thread sleep.
createDelayListenerhardcodes30 * 1000whileTestIndexStoreListeneralready definesSHARD_DELETE_DELAY_SECONDS = 30. For consistency and easier future tuning, consider reusing a shared constant (or at least a named constant in this class) instead of the literal:- private ClusterStateListener createDelayListener(ClusterApplierService applierService) { - return event -> applierService.runOnApplierThread("NodeJoinLeftIT", clusterState -> { - try { - logger.info("Sleeping for 30 seconds"); - Thread.sleep(30 * 1000); + private static final int APPLIER_DELAY_SECONDS = 30; + + private ClusterStateListener createDelayListener(ClusterApplierService applierService) { + return event -> applierService.runOnApplierThread("NodeJoinLeftIT", clusterState -> { + try { + logger.info("Sleeping for {} seconds", APPLIER_DELAY_SECONDS); + Thread.sleep(APPLIER_DELAY_SECONDS * 1000L); } catch (InterruptedException e) { logger.info("Interrupted while waiting for cluster state applier"); Thread.currentThread().interrupt(); } }, (source, e) -> logger.error(() -> new ParameterizedMessage("{} unexpected error in listener wait", source), e)); }Not functionally required, but it avoids magic numbers and keeps the delay configuration in one place.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
CHANGELOG.md(1 hunks)server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java(9 hunks)server/src/internalClusterTest/java/org/opensearch/cluster/coordination/RequiresTestIndexStoreListener.java(1 hunks)test/framework/src/main/java/org/opensearch/test/TestLogsAppender.java(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- CHANGELOG.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)
- GitHub Check: gradle-check
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, macos-15)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: Analyze (java)
- GitHub Check: detect-breaking-change
- GitHub Check: Mend Security Check
🔇 Additional comments (6)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/RequiresTestIndexStoreListener.java (1)
16-21: Annotation design is appropriate and minimal.Runtime retention and METHOD target are a good fit for gating the test listener; no issues here.
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (5)
101-121: Annotation-based plugin gating looks good; reflection approach is reasonable.Using
RequiresTestIndexStoreListenerplusrequiresTestIndexStoreListener()to decide whether to addTestIndexStoreListenerPluginis a clear improvement over string-based test-name matching and keeps the plugin test-scoped.One small consideration:
getClass().getMethod(getTestName())assumes the test name corresponds exactly to a no-arg method on this class. That’s true for the current JUnit-style tests; if this class ever starts using parameterized or dynamically named tests, this logic may need revisiting. For now, it’s acceptable.
351-368: Static delay flag management is correct and safely cleaned up.The combination of:
@RequiresTestIndexStoreListenerto ensure the plugin is present for this test, andTestIndexStoreListener.DELAY_SHARD_ASSIGNMENTguarded by atry/finallyensures the static flag is always reset even if assertions fail. This addresses the earlier concern about static mutable state leaking between tests.
No changes needed here.
396-438: Lag-setup helper and log-capture wiring look coherent.
additionalSetupForLagDuringDataMigration()cleanly centralizes:
- Cluster topology expansion (extra cluster-manager-only and data-only nodes),
- Index creation, and
- Transient settings for follower lag, publish timeout, and concurrent recoveries.
Coupling that with:
testLogsAppender.addMessagesToCapture(Set.of( "Sleeping for 30 seconds", "NodeRemovalClusterStateTaskExecutor", "reason: lagging" )); testLogsAppender.clearCapturedLogs();sets up exactly the messages asserted in
validateNodeDropDueToPublicationLag()while starting from an empty captured-log buffer. Overall this helper is well-structured.Given
@ClusterScope(scope = Scope.TEST), the transient settings are naturally scoped to each test’s cluster instance, so explicit reset isn’t strictly necessary here.
469-490: Cluster recovery validation logic is clear and appropriate.
validateClusterRecovery():
- Waits for GREEN with all 9 nodes present, with a 60s timeout,
- Asserts no timeout,
- Confirms the expected shard count (3), and
- Verifies all shards are allocated to nodes with
color == "red".This neatly checks both cluster health and data placement after lag removal. Looks good.
558-602: Test-only IndexStorePlugin and listener are well-scoped and aligned with tests.The nested
TestIndexStoreListenerPlugin/TestIndexStoreListenersetup:
- Properly implements
IndexStorePlugin#getIndexStoreListenerusingOptional,- Uses a static, volatile flag
DELAY_SHARD_ASSIGNMENTgated from the test to control behavior,- Introduces a named
SHARD_DELETE_DELAY_SECONDSconstant, and- Logs both before and after the artificial delay, with correct interrupt handling.
Combined with the annotation-based plugin gating and the try/finally around
DELAY_SHARD_ASSIGNMENT, this is a solid, test-scoped mechanism for simulating slow shard cleanup on the applier thread.No changes needed here.
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Show resolved
Hide resolved
|
❌ Gradle check result for 1ec8f19: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Show resolved
Hide resolved
1ec8f19 to
e7d4f6f
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (1)
417-424: Ensure transient cluster settings are cleaned up.The transient settings (follower lag timeout, publish timeout, concurrent recoveries) are set here but not explicitly cleaned up. While the test framework may reset these between test classes, explicit cleanup is safer to prevent test pollution.
Consider adding a cleanup method or capturing and restoring the original values:
+ private Settings originalTransientSettings; + private void additionalSetupForLagDuringDataMigration() { + // Capture current settings + originalTransientSettings = internalCluster().client().admin().cluster().prepareState().get() + .getState().metadata().transientSettings(); + // ... existing setup code ...Then in
tearDown():@After public void tearDown() throws Exception { + if (originalTransientSettings != null) { + ClusterUpdateSettingsRequest resetRequest = new ClusterUpdateSettingsRequest(); + resetRequest.transientSettings(Settings.builder().put(originalTransientSettings).build()); + internalCluster().client().admin().cluster().updateSettings(resetRequest).actionGet(); + } testLogsAppender.clearCapturedLogs(); // ... existing teardown code ...
🧹 Nitpick comments (2)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (2)
145-150: Consider using Arrays.asList or List.of for initialization.The double-brace initialization creates an anonymous inner class, which adds slight overhead. For cleaner code, consider:
- List<String> messagesToCapture = new ArrayList<String>() { - { - add("failed to join"); - add("IllegalStateException"); - } - }; + List<String> messagesToCapture = new ArrayList<>(Arrays.asList( + "failed to join", + "IllegalStateException" + ));
494-504: Consider extracting the delay value to a constant.The hardcoded
30 * 1000on line 498 should be extracted to a named constant for consistency withTestIndexStoreListener.SHARD_DELETE_DELAY_SECONDSand better maintainability.+ private static final int APPLIER_DELAY_SECONDS = 30; + private ClusterStateListener createDelayListener(ClusterApplierService applierService) { return event -> applierService.runOnApplierThread("NodeJoinLeftIT", clusterState -> { try { - logger.info("Sleeping for 30 seconds"); - Thread.sleep(30 * 1000); + logger.info("Sleeping for {} seconds", APPLIER_DELAY_SECONDS); + Thread.sleep(APPLIER_DELAY_SECONDS * 1000); } catch (InterruptedException e) { logger.info("Interrupted while waiting for cluster state applier"); Thread.currentThread().interrupt(); } }, (source, e) -> logger.error(() -> new ParameterizedMessage("{} unexpected error in listener wait", source), e)); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
CHANGELOG.md(1 hunks)server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java(9 hunks)server/src/internalClusterTest/java/org/opensearch/cluster/coordination/RequiresTestIndexStoreListener.java(1 hunks)test/framework/src/main/java/org/opensearch/test/TestLogsAppender.java(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- CHANGELOG.md
- test/framework/src/main/java/org/opensearch/test/TestLogsAppender.java
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (1)
server/src/main/java/org/opensearch/cluster/coordination/FollowersChecker.java (1)
FollowersChecker(84-537)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
- GitHub Check: gradle-check
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: Analyze (java)
- GitHub Check: Mend Security Check
🔇 Additional comments (7)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/RequiresTestIndexStoreListener.java (1)
1-22: LGTM!The annotation definition is clean and follows standard patterns for runtime marker annotations. The javadoc clearly explains its purpose.
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (6)
108-130: LGTM!The annotation-based plugin loading mechanism is well-implemented. Using reflection to detect the
@RequiresTestIndexStoreListenerannotation at runtime is appropriate for test infrastructure and addresses the previous concern about fragile string-based matching.
351-368: LGTM!The try-finally pattern ensures
DELAY_SHARD_ASSIGNMENTis always reset, preventing test pollution. The explicit reset on line 363 before the finally block appears redundant but is acceptable for clarity.
370-396: LGTM!The try-finally pattern ensures cluster state listeners are always removed, even if validation fails. This prevents the 30-second delays from leaking into subsequent tests and addresses the previous concern about resource cleanup.
442-469: LGTM!The validation logic comprehensively checks for expected cluster behavior: node drops, lag-induced delays, and join failures. The timeout values are reasonable for integration tests.
471-492: LGTM!The recovery validation ensures the cluster returns to a healthy state and all shards are migrated to red nodes as expected. The assertions provide good coverage of the recovery scenario.
560-604: LGTM!The test plugin and listener implementations are well-structured:
- Correct use of
volatilefor thread-safe flag access- Named constant for delay duration improves maintainability
- Proper interrupt status restoration in the exception handler
- Clean integration with OpenSearch's
IndexStorePlugininterface
|
❌ Gradle check result for e7d4f6f: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
e7d4f6f to
fa2a531
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (1)
417-424: Transient cluster settings cleanup.While
@ClusterScope(scope = Scope.TEST)should recreate the cluster per test method (mitigating leakage risk), consider explicitly resetting these transient settings in thefinallyblock of the calling test methods or intearDown()for defensive coding.The cluster scope is
Scope.TEST, which recreates the cluster per test. Please confirm this guarantees settings isolation, or add explicit cleanup as a safeguard.
🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (1)
145-150: Prefernew ArrayList<>(List.of(...))over double-brace initialization.Double-brace initialization creates an anonymous inner class that holds a reference to the enclosing instance, which can cause memory leaks and increases bytecode size.
🔎 Suggested fix:
- List<String> messagesToCapture = new ArrayList<String>() { - { - add("failed to join"); - add("IllegalStateException"); - } - }; + List<String> messagesToCapture = new ArrayList<>(List.of( + "failed to join", + "IllegalStateException" + ));
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
CHANGELOG.md(1 hunks)server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java(9 hunks)server/src/internalClusterTest/java/org/opensearch/cluster/coordination/RequiresTestIndexStoreListener.java(1 hunks)test/framework/src/main/java/org/opensearch/test/TestLogsAppender.java(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- CHANGELOG.md
- test/framework/src/main/java/org/opensearch/test/TestLogsAppender.java
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (1)
server/src/main/java/org/opensearch/cluster/coordination/FollowersChecker.java (1)
FollowersChecker(84-537)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: gradle-check
- GitHub Check: detect-breaking-change
- GitHub Check: Analyze (java)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: Mend Security Check
🔇 Additional comments (8)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/RequiresTestIndexStoreListener.java (1)
1-22: LGTM!The annotation is properly defined with
RUNTIMEretention (required for reflection-based detection inrequiresTestIndexStoreListener()) andMETHODtarget. Clean and minimal implementation.server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (7)
107-130: LGTM!The annotation-based approach for conditional plugin registration is well-implemented. The reflection logic correctly retrieves the test method and checks for
@RequiresTestIndexStoreListener. ReturningfalseonNoSuchMethodExceptionis a safe default.
351-368: LGTM!The
try-finallyblock ensuresDELAY_SHARD_ASSIGNMENTis reset even if the test fails. The annotation-based gating with@RequiresTestIndexStoreListeneris properly applied.
370-396: LGTM!The
try-finallyblock ensures listeners are always removed, preventing pollution of subsequent tests. ThevalidateClusterRecovery()call after cleanup is appropriately placed.
442-469: LGTM!The validation method comprehensively verifies:
- Node drop occurred (cluster size < 9)
- Delay in shard cleanup was triggered (log verification)
- Node removal due to publication lag (log verification)
- Join request failures with expected exception
471-492: LGTM!The recovery validation properly verifies:
- Cluster returns to green status with all 9 nodes
- All 3 shards are active (index-1: 1P+1R, test: 1P)
- All shards migrated to red nodes
494-504: LGTM!The delay listener correctly:
- Restores interrupt status with
Thread.currentThread().interrupt()- Uses
source(note) in theParameterizedMessage- Schedules work on the applier thread to simulate blocking
560-604: LGTM!The plugin and listener implementation:
- Uses a named constant
SHARD_DELETE_DELAY_SECONDSinstead of magic number- Properly restores interrupt status in the catch block
- Uses
volatilefor thread-safe visibility ofDELAY_SHARD_ASSIGNMENT- Includes informative logging before and after the delay
|
❌ Gradle check result for fa2a531: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
…de cluster state publication lag because the cluster applier thread being busy. Signed-off-by: Swetha Guptha <gupthasg@amazon.com>
fa2a531 to
59ddad9
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
test/framework/src/main/java/org/opensearch/test/TestLogsAppender.java (1)
35-37: Variable shadowing: rename parameter to avoid confusion.The parameter
messagesToCaptureshadows the instance field of the same name, which reduces code clarity.🔎 Apply this diff to rename the parameter:
-public void addMessagesToCapture(Set<String> messagesToCapture) { - messagesToCapture.forEach(this::addMessageToCapture); +public void addMessagesToCapture(Set<String> messages) { + messages.forEach(this::addMessageToCapture); }server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (2)
418-425: Consider explicit cleanup of transient cluster settings.The transient cluster settings (
follower_lag.timeout,publish.timeout,cluster_concurrent_recoveries) are modified here but not explicitly reset. While the test framework likely handles cleanup between tests, explicitly resetting these settings in a finally block ortearDown()method would be more robust and prevent potential test pollution.🔎 Consider adding explicit cleanup:
You could capture the original values before modification and restore them in a finally block within each test method that calls this helper, or add a corresponding cleanup helper method that resets these to defaults.
Example approach:
private void cleanupClusterSettings() { ClusterUpdateSettingsRequest resetRequest = new ClusterUpdateSettingsRequest(); Settings resetSettings = Settings.builder() .putNull("cluster.follower_lag.timeout") .putNull("cluster.publish.timeout") .putNull("cluster.routing.allocation.cluster_concurrent_recoveries") .build(); resetRequest.transientSettings(resetSettings); internalCluster().client().admin().cluster().updateSettings(resetRequest).actionGet(); }Then call this in the finally blocks of the test methods after removing listeners.
497-507: Consider extracting the 30-second delay to a constant.The delay value is hardcoded here as
30 * 1000(line 501), whileTestIndexStoreListeneruses a named constantSHARD_DELETE_DELAY_SECONDS = 30(line 573). For consistency and maintainability, consider extracting this to a constant or reusing the existing one.🔎 Apply this diff to improve consistency:
At class level, add:
+ private static final int APPLIER_DELAY_SECONDS = 30;Then update the sleep:
- logger.info("Sleeping for 30 seconds"); - Thread.sleep(30 * 1000); + logger.info("Sleeping for {} seconds", APPLIER_DELAY_SECONDS); + Thread.sleep(APPLIER_DELAY_SECONDS * 1000);Note: The interrupt handling and error callback are correctly implemented per previous review feedback.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
CHANGELOG.md(1 hunks)server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java(8 hunks)server/src/internalClusterTest/java/org/opensearch/cluster/coordination/RequiresTestIndexStoreListener.java(1 hunks)test/framework/src/main/java/org/opensearch/test/TestLogsAppender.java(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- server/src/internalClusterTest/java/org/opensearch/cluster/coordination/RequiresTestIndexStoreListener.java
- CHANGELOG.md
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (1)
server/src/main/java/org/opensearch/cluster/coordination/FollowersChecker.java (1)
FollowersChecker(84-537)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)
- GitHub Check: gradle-check
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, macos-15)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: Analyze (java)
- GitHub Check: detect-breaking-change
- GitHub Check: Mend Security Check
🔇 Additional comments (9)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (9)
36-86: LGTM! Import additions support the new test infrastructure.The added imports are necessary for the cluster state listener functionality, IndexStorePlugin integration, and improved logging with ParameterizedMessage.
108-130: LGTM! Annotation-based approach improves maintainability.The reflection-based mechanism properly addresses the previous concern about fragile string matching. This approach is more robust and survives test method renames.
145-150: LGTM! Modifiable collection enables dynamic log message capture.The switch to ArrayList allows the test framework to add additional messages to capture at line 439, which is necessary for the lag simulation tests.
159-169: LGTM! Field extraction enables settings reuse.Extracting nodeSettings to a field allows consistent configuration across multiple node startups in the test methods.
351-368: LGTM! Try-finally ensures proper cleanup of static test state.The test properly uses try-finally to guarantee that
DELAY_SHARD_ASSIGNMENTis reset even if the test fails, preventing pollution of subsequent tests. This addresses the critical concern from previous reviews.
370-397: LGTM! Listener lifecycle properly managed with try-finally.The test ensures that long-running cluster state listeners are always removed in the finally block, even if validation fails. This prevents the 30-second delays from affecting subsequent tests. Well done addressing the critical concern from previous reviews.
443-470: LGTM! Validation logic properly verifies publication lag scenario.The method correctly waits for node drops and validates the expected log sequence: delay introduction, node removal due to lag, and join request failures. The 120-second timeout provides adequate buffer for the configured lag timeouts (5s + 15s = 20s) and the 30-second delay.
472-495: LGTM! Recovery validation ensures cluster stabilizes and shards migrate correctly.The method properly validates that:
- The cluster recovers to green status with all 9 nodes
- All 3 shards are active (1 shard from "test" index + 2 shards from "index-1")
- All shards successfully migrated to red nodes
562-604: LGTM! Test infrastructure properly implements IndexStoreListener with correct interrupt handling.The implementation correctly:
- Uses a named constant for the delay duration
- Restores thread interrupt status in the catch block (line 597)
- Logs comprehensive diagnostic information including thread name and timing
- Uses volatile boolean for cross-thread visibility of the DELAY_SHARD_ASSIGNMENT flag
All critical concerns from previous reviews have been properly addressed.
|
❌ Gradle check result for 59ddad9: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: shwetathareja <shwetathareja@live.com>
|
❌ Gradle check result for 85da046: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
rajiv-kv
left a comment
There was a problem hiding this comment.
Thanks for making the changes.
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Show resolved
Hide resolved
…ster state publication lag because the cluster applier thread being busy. Signed-off-by: Swetha Guptha <gupthasg@amazon.com>
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (1)
162-169: Duplicate setting detected.
NodeConnectionsService.CLUSTER_NODE_RECONNECT_INTERVAL_SETTINGis configured twice with different values:"10s"at line 164 and"100ms"at line 168. The second value overwrites the first, making line 164 ineffective dead code.🔎 Proposed fix
this.nodeSettings = Settings.builder() .put(RecoverySettings.INDICES_RECOVERY_RETRY_DELAY_NETWORK_SETTING.getKey(), "100ms") - .put(NodeConnectionsService.CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.getKey(), "10s") .put(FollowersChecker.FOLLOWER_CHECK_TIMEOUT_SETTING.getKey(), "200ms") .put(FollowersChecker.FOLLOWER_CHECK_INTERVAL_SETTING.getKey(), "100ms") .put(FollowersChecker.FOLLOWER_CHECK_RETRY_COUNT_SETTING.getKey(), 1) .put(NodeConnectionsService.CLUSTER_NODE_RECONNECT_INTERVAL_SETTING.getKey(), "100ms") .build();
🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (1)
432-446: Consider extracting magic numbers to named constants.The delay values (3 seconds at line 437, 30 seconds at line 510) are hardcoded. While
SHARD_DELETE_DELAY_SECONDSis properly defined as a constant inTestIndexStoreListener, similar constants could improve readability for these delays.🔎 Optional improvement
private static final int CM_APPLIER_DELAY_SECONDS = 3; private static final int LISTENER_DELAY_SECONDS = 30;
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (1)
server/src/main/java/org/opensearch/cluster/coordination/FollowersChecker.java (1)
FollowersChecker(84-537)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: gradle-check
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: detect-breaking-change
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: Analyze (java)
🔇 Additional comments (4)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (4)
111-133: Annotation-based plugin loading is a good improvement.The use of
@RequiresTestIndexStoreListenerannotation with reflection-based detection is a clean solution that addresses the previous concern about string-based test name matching. This approach is more maintainable and won't break if tests are renamed.
354-400: LGTM! Proper resource cleanup with try-finally.Both new test methods correctly use try-finally blocks to ensure cleanup:
testClusterStabilityWhenClusterStatePublicationLagsOnShardCleanupresetsDELAY_SHARD_ASSIGNMENTin finallytestClusterStabilityWhenClusterStatePublicationLagsWithLongRunningListenerOnApplierThreadremoves listeners in finallyThis addresses the previous concerns about static state pollution and listener leaks.
480-503: Validation logic looks correct.The
validateClusterRecoverymethod properly:
- Waits for green status with all 9 nodes
- Verifies exactly 3 active shards (1 from "test" index + 2 from "index-1")
- Confirms all shards migrated to red nodes via cluster state inspection
This addresses the previous review request to assert shard migration and data integrity.
574-622: LGTM! Plugin and listener implementation addresses all previous concerns.The implementation correctly:
- Uses a named constant
SHARD_DELETE_DELAY_SECONDSinstead of magic number- Restores interrupt status with
Thread.currentThread().interrupt()in catch blocks- Properly shuts down the
ScheduledExecutorServiceafter use- Uses
volatilefor thread-safe access toDELAY_SHARD_ASSIGNMENT
|
❌ Gradle check result for 131aa8d: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #19907 +/- ##
============================================
+ Coverage 73.21% 73.23% +0.01%
+ Complexity 71776 71750 -26
============================================
Files 5795 5795
Lines 328304 328304
Branches 47281 47281
============================================
+ Hits 240374 240432 +58
+ Misses 68684 68540 -144
- Partials 19246 19332 +86 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
5e80506
into
opensearch-project:main
opensearch-project#19907) * Add integration test for simulating node join left event when data node cluster state publication lag because the cluster applier thread being busy. Signed-off-by: Swetha Guptha <gupthasg@amazon.com>
opensearch-project#19907) * Add integration test for simulating node join left event when data node cluster state publication lag because the cluster applier thread being busy. Signed-off-by: Swetha Guptha <gupthasg@amazon.com>
opensearch-project#19907) * Add integration test for simulating node join left event when data node cluster state publication lag because the cluster applier thread being busy. Signed-off-by: Swetha Guptha <gupthasg@amazon.com> Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
opensearch-project#19907) * Add integration test for simulating node join left event when data node cluster state publication lag because the cluster applier thread being busy. Signed-off-by: Swetha Guptha <gupthasg@amazon.com>
opensearch-project#19907) * Add integration test for simulating node join left event when data node cluster state publication lag because the cluster applier thread being busy. Signed-off-by: Swetha Guptha <gupthasg@amazon.com> Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
opensearch-project#19907) * Add integration test for simulating node join left event when data node cluster state publication lag because the cluster applier thread being busy. Signed-off-by: Swetha Guptha <gupthasg@amazon.com> Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
* Add preserve_dots paramater to ObjectMapper Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Extend RootObjectMapper to support preserve_dots Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Implement flat field parsing in DocumentParser Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Implement validation in MapperService Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Add index template support for preserve_dots tests Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * error handling with descriptive messages Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Removing redundancy Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding more edge cases, autoflatten and test cases Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding more edge cases, autoflatten and test cases Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding autoflattening at nested level as well Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Fixing gradle spotless checks Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Stab at optimizing the code via removing redundant code and fixing multiple functions Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Removing splitPathRespectingDisableObjects and trying to achieve it through existing getMapper function Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Revert "Reduction of if else statements by extracting conditional logic into focused methods" This reverts commit 9ca7e6d. Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * [Plugins] Relax jar hell check when extended plugins share transitive dependencies (#20103) * plugin service allow multiple extended plugins both extend the same module/plugin Signed-off-by: Karen X <karenxyr@gmail.com> * fix tests Signed-off-by: Karen X <karenxyr@gmail.com> --------- Signed-off-by: Karen X <karenxyr@gmail.com> Co-authored-by: karenx <karenx@uber.com> * Reverting change in CHANGELOG.md Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Add integ test for simulating node join left event when data node clu… (#19907) * Add integration test for simulating node join left event when data node cluster state publication lag because the cluster applier thread being busy. Signed-off-by: Swetha Guptha <gupthasg@amazon.com> Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding few more UTs in DocumentParserTests and ObjectMapperTests for code patch coverage Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding changes in CHANGELOG.md Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding more tests for code patch coverage for DocumentParser and ObjectMapper Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow1 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow2 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow3 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding more tests for code patch coverage for DocumentParser and ObjectMapper second round and refactoring of code especially dynamic update portion Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Removing dead end code from Document Parser related to parseArray and parseObject Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow4 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow4 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow5 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow6 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow6 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow7 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow8 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Fixing breaking changes in ObjectMapper builder method Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Fixing spotless checks Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow9 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow9 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Handling v2 templates for dotted field feature and stab at using existing methods in switch case Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Handling v2 templates for dotted field feature and stab at using existing methods in switch case Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Fixing org.opensearch.indices.template.SimpleIndexTemplateIT IT Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Fixing array mapping with nested disable_objects as true Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Addressed review comments and removed redundant code Signed-off-by: Mohit Kumar <mohitamg@amazon.com> --------- Signed-off-by: Mohit Kumar <mohitamg@amazon.com> Signed-off-by: mohit10011999 <pinkijyoti1995@gmail.com> Signed-off-by: Karen X <karenxyr@gmail.com> Signed-off-by: Swetha Guptha <gupthasg@amazon.com> Co-authored-by: Mohit Kumar <mohitamg@amazon.com> Co-authored-by: Karen X <karenxyr@gmail.com> Co-authored-by: karenx <karenx@uber.com> Co-authored-by: Swetha Guptha <gupthasg@amazon.com>
opensearch-project#19907) * Add integration test for simulating node join left event when data node cluster state publication lag because the cluster applier thread being busy. Signed-off-by: Swetha Guptha <gupthasg@amazon.com>
…rch-project#19958) * Add preserve_dots paramater to ObjectMapper Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Extend RootObjectMapper to support preserve_dots Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Implement flat field parsing in DocumentParser Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Implement validation in MapperService Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Add index template support for preserve_dots tests Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * error handling with descriptive messages Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Removing redundancy Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding more edge cases, autoflatten and test cases Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding more edge cases, autoflatten and test cases Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding autoflattening at nested level as well Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Fixing gradle spotless checks Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Stab at optimizing the code via removing redundant code and fixing multiple functions Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Removing splitPathRespectingDisableObjects and trying to achieve it through existing getMapper function Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Revert "Reduction of if else statements by extracting conditional logic into focused methods" This reverts commit 9ca7e6d. Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * [Plugins] Relax jar hell check when extended plugins share transitive dependencies (opensearch-project#20103) * plugin service allow multiple extended plugins both extend the same module/plugin Signed-off-by: Karen X <karenxyr@gmail.com> * fix tests Signed-off-by: Karen X <karenxyr@gmail.com> --------- Signed-off-by: Karen X <karenxyr@gmail.com> Co-authored-by: karenx <karenx@uber.com> * Reverting change in CHANGELOG.md Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Add integ test for simulating node join left event when data node clu… (opensearch-project#19907) * Add integration test for simulating node join left event when data node cluster state publication lag because the cluster applier thread being busy. Signed-off-by: Swetha Guptha <gupthasg@amazon.com> Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding few more UTs in DocumentParserTests and ObjectMapperTests for code patch coverage Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding changes in CHANGELOG.md Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding more tests for code patch coverage for DocumentParser and ObjectMapper Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow1 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow2 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow3 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding more tests for code patch coverage for DocumentParser and ObjectMapper second round and refactoring of code especially dynamic update portion Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Removing dead end code from Document Parser related to parseArray and parseObject Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow4 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow4 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow5 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow6 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow6 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow7 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow8 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Fixing breaking changes in ObjectMapper builder method Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Fixing spotless checks Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow9 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow9 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Handling v2 templates for dotted field feature and stab at using existing methods in switch case Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Handling v2 templates for dotted field feature and stab at using existing methods in switch case Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Fixing org.opensearch.indices.template.SimpleIndexTemplateIT IT Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Fixing array mapping with nested disable_objects as true Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Addressed review comments and removed redundant code Signed-off-by: Mohit Kumar <mohitamg@amazon.com> --------- Signed-off-by: Mohit Kumar <mohitamg@amazon.com> Signed-off-by: mohit10011999 <pinkijyoti1995@gmail.com> Signed-off-by: Karen X <karenxyr@gmail.com> Signed-off-by: Swetha Guptha <gupthasg@amazon.com> Co-authored-by: Mohit Kumar <mohitamg@amazon.com> Co-authored-by: Karen X <karenxyr@gmail.com> Co-authored-by: karenx <karenx@uber.com> Co-authored-by: Swetha Guptha <gupthasg@amazon.com>
…rch-project#19958) * Add preserve_dots paramater to ObjectMapper Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Extend RootObjectMapper to support preserve_dots Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Implement flat field parsing in DocumentParser Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Implement validation in MapperService Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Add index template support for preserve_dots tests Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * error handling with descriptive messages Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Removing redundancy Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding more edge cases, autoflatten and test cases Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding more edge cases, autoflatten and test cases Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding autoflattening at nested level as well Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Fixing gradle spotless checks Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Stab at optimizing the code via removing redundant code and fixing multiple functions Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Removing splitPathRespectingDisableObjects and trying to achieve it through existing getMapper function Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Reduction of if else statements by extracting conditional logic into focused methods Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Revert "Reduction of if else statements by extracting conditional logic into focused methods" This reverts commit 9ca7e6d. Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * [Plugins] Relax jar hell check when extended plugins share transitive dependencies (opensearch-project#20103) * plugin service allow multiple extended plugins both extend the same module/plugin Signed-off-by: Karen X <karenxyr@gmail.com> * fix tests Signed-off-by: Karen X <karenxyr@gmail.com> --------- Signed-off-by: Karen X <karenxyr@gmail.com> Co-authored-by: karenx <karenx@uber.com> * Reverting change in CHANGELOG.md Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Add integ test for simulating node join left event when data node clu… (opensearch-project#19907) * Add integration test for simulating node join left event when data node cluster state publication lag because the cluster applier thread being busy. Signed-off-by: Swetha Guptha <gupthasg@amazon.com> Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding few more UTs in DocumentParserTests and ObjectMapperTests for code patch coverage Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding changes in CHANGELOG.md Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding more tests for code patch coverage for DocumentParser and ObjectMapper Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow1 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow2 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow3 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Adding more tests for code patch coverage for DocumentParser and ObjectMapper second round and refactoring of code especially dynamic update portion Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Removing dead end code from Document Parser related to parseArray and parseObject Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow4 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow4 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow5 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow6 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow6 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow7 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow8 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Fixing breaking changes in ObjectMapper builder method Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Fixing spotless checks Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow9 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Trigger workflow9 Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Handling v2 templates for dotted field feature and stab at using existing methods in switch case Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Handling v2 templates for dotted field feature and stab at using existing methods in switch case Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Fixing org.opensearch.indices.template.SimpleIndexTemplateIT IT Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Fixing array mapping with nested disable_objects as true Signed-off-by: Mohit Kumar <mohitamg@amazon.com> * Addressed review comments and removed redundant code Signed-off-by: Mohit Kumar <mohitamg@amazon.com> --------- Signed-off-by: Mohit Kumar <mohitamg@amazon.com> Signed-off-by: mohit10011999 <pinkijyoti1995@gmail.com> Signed-off-by: Karen X <karenxyr@gmail.com> Signed-off-by: Swetha Guptha <gupthasg@amazon.com> Co-authored-by: Mohit Kumar <mohitamg@amazon.com> Co-authored-by: Karen X <karenxyr@gmail.com> Co-authored-by: karenx <karenx@uber.com> Co-authored-by: Swetha Guptha <gupthasg@amazon.com>
…ster state publication lag because the cluster applier thread being busy.
Description
Add test for cluster stability to verify cluster becomes stable after node join-left loop in cluster due to cluster publication lag because of cluster state applier thread occupied by a cluster state listener for a long running task. This simulates the scenario where the cluster applier thread is busy with shard clean up activity leading to node drops because of publication lag.
Setup:
Creates 7-node cluster (1 cluster manager + 6 data nodes)
Adds slow cluster state listener to subset of data nodes (30s sleep)
Continuously moves shards between nodes to trigger cluster state changes
Verifies cluster remains stabilizes after the cluster state listener is removed.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Summary by CodeRabbit
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.