Skip to content

Fix closed-index reroute handling#20648

Open
srikanthpadakanti wants to merge 9 commits intoopensearch-project:mainfrom
srikanthpadakanti:20200-closed-index-reroute-fix
Open

Fix closed-index reroute handling#20648
srikanthpadakanti wants to merge 9 commits intoopensearch-project:mainfrom
srikanthpadakanti:20200-closed-index-reroute-fix

Conversation

@srikanthpadakanti
Copy link
Copy Markdown
Contributor

Description

Prevents allocation of shards from closed indices during cluster reroute operations. Added a check in AllocationService.allocateExistingUnassignedShards() to skip shards belonging to closed indices, ensuring they remain unassigned until the index is explicitly reopened.

Related Issues

Resolves #20200

Check List

  • [ X ] Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 17, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR fixes a bug where shard routings for closed indexes were being allocated without the index being reopened. The fix adds guards to skip allocation and batching when shards have an INDEX_CLOSED unassigned reason, along with comprehensive test coverage.

Changes

Cohort / File(s) Summary
Documentation
CHANGELOG.md
Added entry documenting the fix for shard routings of closed indexes being allocated without opening the index.
Core Allocation Logic
server/src/main/java/org/opensearch/gateway/GatewayAllocator.java, server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java
Added guards to skip allocation and batching for unassigned shards with INDEX_CLOSED reason. GatewayAllocator returns early before primary/replica allocation logic; ShardsBatchGatewayAllocator prevents adding closed-index shards to batches.
Test Coverage
server/src/test/java/org/opensearch/gateway/GatewayAllocatorTests.java
Added three new test methods: testIndexClosedShardsNotBatchedInBatchMode (verifies closed shards not batched), testIndexClosedShardsSkippedInNonBatchMode (verifies no allocation attempts for closed shards), and testNonClosedShardsStillBatchedNormally (verifies normal batching still works for open indexes).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: fixing how closed-index shard routings are handled during reroute operations to prevent unwanted allocation.
Description check ✅ Passed The PR description includes all required sections from the template: a clear description of changes, a linked issue reference (#20200), and a completed checklist confirming testing was included.
Linked Issues check ✅ Passed The code changes fully address the linked issue #20200 by adding guards in GatewayAllocator and ShardsBatchGatewayAllocator to skip allocation of shards with INDEX_CLOSED reason, preventing unwanted allocation of closed-index shards during reroute.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing closed-index reroute handling: two guard implementations to skip INDEX_CLOSED shards, CHANGELOG entry, and comprehensive test coverage with no unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Srikanth Padakanti added 2 commits February 17, 2026 12:47
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0bbea60 and 9f4ae54.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • server/src/main/java/org/opensearch/gateway/GatewayAllocator.java
  • server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java
  • server/src/test/java/org/opensearch/gateway/GatewayAllocatorTests.java
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-13T17:40:27.167Z
Learnt from: reta
Repo: opensearch-project/OpenSearch PR: 20411
File: server/src/main/java/org/opensearch/index/codec/CodecService.java:112-133
Timestamp: 2026-01-13T17:40:27.167Z
Learning: Avoid capturing or evaluating a supplier (e.g., this::defaultCodec) upfront when passing it to a registry during object construction. If registries may replace defaults during iteration (as in EnginePlugin.getAdditionalCodecs), pass the supplier itself and only resolve it at use time. This ensures dynamic behavior is preserved during initialization and prevents premature binding of defaults in codecs/registry setup. This pattern should apply to similar initialization paths in Java server code where registries may mutate defaults during construction.

Applied to files:

  • server/src/main/java/org/opensearch/gateway/GatewayAllocator.java
  • server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java
🧬 Code graph analysis (1)
server/src/test/java/org/opensearch/gateway/GatewayAllocatorTests.java (1)
server/src/main/java/org/opensearch/gateway/GatewayAllocator.java (1)
  • GatewayAllocator (74-369)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (21, macos-15)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (25, macos-15)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: detect-breaking-change
🔇 Additional comments (8)
CHANGELOG.md (1)

32-32: Changelog entry matches the fix scope.

server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java (1)

23-23: Import aligns with INDEX_CLOSED checks.

server/src/main/java/org/opensearch/gateway/GatewayAllocator.java (2)

46-46: Import supports the new INDEX_CLOSED guard.


183-187: Correctly skips allocation for closed indices.

server/src/test/java/org/opensearch/gateway/GatewayAllocatorTests.java (4)

26-33: Imports support the new test scenarios.


671-745: Good coverage for batch mode with INDEX_CLOSED shards.


747-881: Non-batch mode guard is well validated.


883-946: Control test for OPEN index batching is solid.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java`:
- Around line 404-412: Move the INDEX_CLOSED guard so closed-index shards are
handled before checking currentBatchedShards: in the unassigned.forEach(...)
lambda (the shardRouting loop), first check shardRouting.unassignedInfo() !=
null && shardRouting.unassignedInfo().getReason() ==
UnassignedInfo.Reason.INDEX_CLOSED and if true remove/evict that shardId from
currentBatchedShards (so stale batches are cleared, e.g., by calling the same
logic used by refreshShardBatches) and return; only after that test perform the
existing containsKey(shardRouting.shardId()) / primary checks and add to
newShardsToBatch as before. This ensures previously-batched INDEX_CLOSED shards
are not retained.

…hards

Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 24281a0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 5fac3a4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Comment on lines +406 to +408
if (shardRouting.unassignedInfo() != null && shardRouting.unassignedInfo().getReason() == UnassignedInfo.Reason.INDEX_CLOSED) {
return;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this logic in ShardRouting itself so that it can get reused?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bukhtawar Yes. Moved and pushed the change.

…reroute-fix

Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 26, 2026

PR Reviewer Guide 🔍

(Review updated until commit 62c7b43)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add isClosedIndexShard helper method to ShardRouting

Relevant files:

  • server/src/main/java/org/opensearch/cluster/routing/ShardRouting.java
  • server/src/test/java/org/opensearch/cluster/routing/ShardRoutingTests.java

Sub-PR theme: Skip closed index shard allocation in non-batch mode

Relevant files:

  • server/src/main/java/org/opensearch/gateway/GatewayAllocator.java
  • server/src/test/java/org/opensearch/gateway/GatewayAllocatorTests.java

Sub-PR theme: Skip closed index shard batching in batch mode

Relevant files:

  • server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java
  • server/src/test/java/org/opensearch/gateway/GatewayAllocatorTests.java

⚡ Recommended focus areas for review

Early Return Logic

The early return at line 406 uses a bare return statement within a forEach lambda, which only exits the current lambda iteration, not the entire method. This is correct behavior for skipping closed index shards, but should be verified that this doesn't inadvertently skip processing of subsequent shards in the batch.

// Skip allocation for closed index shards - for new shards and already-batched shards whose index subsequently closed
if (shardRouting.isClosedIndexShard()) {
    return;
}
Null Check Redundancy

The method checks unassigned() first, then checks unassignedInfo != null. Since unassigned() should guarantee that unassignedInfo exists, the null check may be redundant. Verify if there are edge cases where an unassigned shard could have null unassignedInfo.

public boolean isClosedIndexShard() {
    if (unassigned() && unassignedInfo != null) {
        return unassignedInfo.getReason() == UnassignedInfo.Reason.INDEX_CLOSED;
    }
    return false;
}
Test Coverage Gap

The test testAlreadyBatchedShardIndexClosedLater doesn't verify what happens to the primary shard when the index is closed. Only the replica shard behavior is tested after closure. Consider adding assertions for primary shard handling.

public void testAlreadyBatchedShardIndexClosedLater() {
    final ShardId shardId = new ShardId("test-index", "_na_", 0);
    final DiscoveryNode node = newNode("node1");

    ShardRouting primaryShard = TestShardRouting.newShardRouting(shardId, node.getId(), true, ShardRoutingState.STARTED);

    ShardRouting replicaShard = ShardRouting.newUnassigned(
        shardId,
        false,
        RecoverySource.PeerRecoverySource.INSTANCE,
        new UnassignedInfo(
            UnassignedInfo.Reason.REPLICA_ADDED,
            "replica added",
            null,
            0,
            System.nanoTime(),
            System.currentTimeMillis(),
            false,
            UnassignedInfo.AllocationStatus.NO_ATTEMPT,
            Collections.emptySet()
        )
    );

    Metadata metadata = Metadata.builder()
        .put(
            IndexMetadata.builder(shardId.getIndexName())
                .settings(settings(Version.CURRENT))
                .numberOfShards(1)
                .numberOfReplicas(1)
                .state(IndexMetadata.State.OPEN)
                .putInSyncAllocationIds(0, Sets.newHashSet(primaryShard.allocationId().getId()))
        )
        .build();

    IndexRoutingTable.Builder indexRoutingTable = IndexRoutingTable.builder(shardId.getIndex())
        .addIndexShard(new IndexShardRoutingTable.Builder(shardId).addShard(primaryShard).addShard(replicaShard).build());

    RoutingTable routingTable = RoutingTable.builder().add(indexRoutingTable).build();

    clusterState = ClusterState.builder(ClusterName.CLUSTER_NAME_SETTING.getDefault(Settings.EMPTY))
        .metadata(metadata)
        .routingTable(routingTable)
        .build();

    testAllocation = new RoutingAllocation(
        new AllocationDeciders(Collections.emptyList()),
        new RoutingNodes(clusterState, false),
        clusterState,
        ClusterInfo.EMPTY,
        SnapshotShardSizeInfo.EMPTY,
        System.nanoTime()
    );

    Set<String> replicaBatches = testShardsBatchGatewayAllocator.createAndUpdateBatches(testAllocation, false);
    assertEquals("Replica shard should be batched", 1, replicaBatches.size());
    assertEquals(1, testShardsBatchGatewayAllocator.getBatchIdToStoreShardBatch().size());

    ShardsBatchGatewayAllocator.ShardsBatch batch = testShardsBatchGatewayAllocator.getBatchIdToStoreShardBatch()
        .values()
        .iterator()
        .next();
    assertTrue("Batch should contain the replica shard", batch.getBatchedShards().contains(shardId));

    ShardRouting closedReplicaShard = ShardRouting.newUnassigned(
        shardId,
        false,
        RecoverySource.PeerRecoverySource.INSTANCE,
        new UnassignedInfo(
            UnassignedInfo.Reason.INDEX_CLOSED,
            "index closed",
            null,
            0,
            System.nanoTime(),
            System.currentTimeMillis(),
            false,
            UnassignedInfo.AllocationStatus.NO_ATTEMPT,
            Collections.emptySet()
        )
    );

    Metadata closedMetadata = Metadata.builder()
        .put(
            IndexMetadata.builder(shardId.getIndexName())
                .settings(settings(Version.CURRENT))
                .numberOfShards(1)
                .numberOfReplicas(1)
                .state(IndexMetadata.State.CLOSE)
                .putInSyncAllocationIds(0, Sets.newHashSet(primaryShard.allocationId().getId()))
        )
        .build();

    IndexRoutingTable.Builder closedIndexRoutingTable = IndexRoutingTable.builder(shardId.getIndex())
        .addIndexShard(new IndexShardRoutingTable.Builder(shardId).addShard(primaryShard).addShard(closedReplicaShard).build());

    RoutingTable closedRoutingTable = RoutingTable.builder().add(closedIndexRoutingTable).build();

    ClusterState closedClusterState = ClusterState.builder(ClusterName.CLUSTER_NAME_SETTING.getDefault(Settings.EMPTY))
        .metadata(closedMetadata)
        .routingTable(closedRoutingTable)
        .build();

    RoutingAllocation closedAllocation = new RoutingAllocation(
        new AllocationDeciders(Collections.emptyList()),
        new RoutingNodes(closedClusterState, false),
        closedClusterState,
        ClusterInfo.EMPTY,
        SnapshotShardSizeInfo.EMPTY,
        System.nanoTime()
    );

    Set<String> batchesAfterClose = testShardsBatchGatewayAllocator.createAndUpdateBatches(closedAllocation, false);

    if (batchesAfterClose.isEmpty()) {
        assertEquals("Batch should be cleaned up when empty", 0, testShardsBatchGatewayAllocator.getBatchIdToStoreShardBatch().size());
    } else {
        ShardsBatchGatewayAllocator.ShardsBatch batchAfterClose = testShardsBatchGatewayAllocator.getBatchIdToStoreShardBatch()
            .values()
            .iterator()
            .next();
        assertFalse(
            "Batch should not contain the closed index shard after reroute",
            batchAfterClose.getBatchedShards().contains(shardId)
        );
    }
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 26, 2026

PR Code Suggestions ✨

Latest suggestions up to 62c7b43

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Add assertion for unassignedInfo consistency

The method checks unassignedInfo != null after already calling unassigned(), but
unassigned() already verifies that state == ShardRoutingState.UNASSIGNED. Since
unassignedInfo should always be non-null for unassigned shards, the null check is
redundant. Consider removing it or adding a defensive assertion to catch unexpected
states.

server/src/main/java/org/opensearch/cluster/routing/ShardRouting.java [960-965]

 public boolean isClosedIndexShard() {
-    if (unassigned() && unassignedInfo != null) {
+    if (unassigned()) {
+        assert unassignedInfo != null : "unassignedInfo should not be null for unassigned shards";
         return unassignedInfo.getReason() == UnassignedInfo.Reason.INDEX_CLOSED;
     }
     return false;
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly identifies that unassignedInfo should not be null for unassigned shards, making the null check potentially redundant. However, the existing null check provides defensive programming that prevents NPE in edge cases. The suggested assertion would help catch unexpected states during development, but this is a minor improvement rather than fixing a critical issue.

Low

Previous suggestions

Suggestions up to commit 16f8050
CategorySuggestion                                                                                                                                    Impact
General
Remove redundant null check

The null check for unassignedInfo is redundant since unassigned() already verifies
that unassignedInfo is not null. When a shard is unassigned, unassignedInfo is
guaranteed to be non-null by the class invariants. Remove the redundant null check
to simplify the logic.

server/src/main/java/org/opensearch/cluster/routing/ShardRouting.java [960-965]

 public boolean isClosedIndexShard() {
-    if (unassigned() && unassignedInfo != null) {
+    if (unassigned()) {
         return unassignedInfo.getReason() == UnassignedInfo.Reason.INDEX_CLOSED;
     }
     return false;
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that the unassignedInfo != null check is redundant if unassigned() guarantees unassignedInfo is non-null. However, this is a defensive programming pattern that may be intentional for safety. The improvement is valid but represents a minor code simplification rather than fixing a critical issue.

Medium
Suggestions up to commit 53e0f18
CategorySuggestion                                                                                                                                    Impact
Possible issue
Verify shard is unassigned first

The method should verify that the shard is actually unassigned before checking the
unassigned reason. A shard that is assigned (STARTED, INITIALIZING, or RELOCATING)
should never be considered a closed index shard, even if it previously had
INDEX_CLOSED as an unassigned reason.

server/src/main/java/org/opensearch/cluster/routing/ShardRouting.java [960-965]

 public boolean isClosedIndexShard() {
-    if (unassignedInfo != null) {
+    if (unassigned() && unassignedInfo != null) {
         return unassignedInfo.getReason() == UnassignedInfo.Reason.INDEX_CLOSED;
     }
     return false;
 }
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a potential logical issue. The method isClosedIndexShard() should verify the shard is actually unassigned before checking the unassigned reason, as assigned shards (STARTED, INITIALIZING, RELOCATING) should not be considered closed index shards. This prevents incorrect behavior if the method is called on assigned shards that previously had INDEX_CLOSED as an unassigned reason.

Medium
Suggestions up to commit a4edf7c
CategorySuggestion                                                                                                                                    Impact
General
Remove redundant null check

The null check for unassignedInfo() is redundant since the method is called within a
context where shardRouting.unassigned() is already asserted to be true. Unassigned
shards always have non-null UnassignedInfo. Remove the null check to simplify the
condition.

server/src/main/java/org/opensearch/gateway/GatewayAllocator.java [185-187]

-if (shardRouting.unassignedInfo() != null && shardRouting.unassignedInfo().getReason() == UnassignedInfo.Reason.INDEX_CLOSED) {
+if (shardRouting.unassignedInfo().getReason() == UnassignedInfo.Reason.INDEX_CLOSED) {
     return;
 }
Suggestion importance[1-10]: 4

__

Why: While the suggestion is technically correct that unassigned shards should have non-null UnassignedInfo, removing the null check could introduce risk if the assumption doesn't hold in all edge cases. The null check provides defensive programming and has minimal performance impact, making this a minor code style improvement rather than a critical change.

Low

Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 53e0f18

Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 16f8050

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 16f8050: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 62c7b43

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 62c7b43: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@srikanthpadakanti
Copy link
Copy Markdown
Contributor Author

@andrross @Bukhtawar I see build failing with few test failures and these tests are expecting cluster to be in green state. and my fix making the cluster state to yellow.

Can you please let me know if my fix make sense and changing cluster state is okay>? if so, I can make changes to those tests and trigger a build again.

@andrross
Copy link
Copy Markdown
Member

tests are expecting cluster to be in green state. and my fix making the cluster state to yellow

@srikanthpadakanti This is a very significant behavior change and does not sound to me like something we'd want to do. Are you saying having closed indices will result in a yellow cluster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Cluster Manager

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[BUG] Shard routings for closed index are allocated again without opening the index

3 participants