Skip to content

Implementing batched deletions of stale ClusterMetadataManifests in RemoteClusterStateCleanupManager and adding configurable deletions timeouts#20515

Closed
Pranshu-S wants to merge 17 commits intoopensearch-project:mainfrom
Pranshu-S:remote-state-cleanup-manager-batched-deletions-and-timeouts
Closed

Implementing batched deletions of stale ClusterMetadataManifests in RemoteClusterStateCleanupManager and adding configurable deletions timeouts#20515
Pranshu-S wants to merge 17 commits intoopensearch-project:mainfrom
Pranshu-S:remote-state-cleanup-manager-batched-deletions-and-timeouts

Conversation

@Pranshu-S
Copy link
Copy Markdown
Contributor

@Pranshu-S Pranshu-S commented Jan 31, 2026

Description

Fixes remote cluster state cleanup failures that were causing stale metadata pile-ups in remote storage when deletions time out along with fixes mention in tagged GitHub Issue

Issue context (#20564): RemoteClusterStateCleanupManager runs every ~5 minutes (configurable) and performs sequential deletions (global metadata → index metadata → ephemeral attrs → manifests). A recent change added a 30s timeout to the S3 “sync” delete path; with large delete sets this can throw IOException, abort the whole cleanup run (single try/catch), and leave later phases undeleted—making the next run even larger and more likely to fail.

What this PR does

  • Batch deletes stale manifest* blobs to reduce per-call delete size and avoid timeout/payload issues.
    • Controlled by:
      • cluster.remote_store.state.cleanup.batch_size
      • cluster.remote_store.state.cleanup.max_batches
  • Makes delete timeout configurable for remote-state cleanup
  • Fixing issues where the update-interval call was not being honoured when the deletion was disabled (set to -1)
  • Fixing mis-match in deletions workflow where we were deleting manifests before the the IndexRouting paths which could lead to IndexRouting paths being dangled
  • Fixing issue where we were moving the lastCleanupAttemptStateVersion even though the cleanup were failing which resulted in next deletes to become no-op if there were no cluster-state changes (or less than 10) and previous deletions failed.
  • Reducing logging overhead of DEBUG as it was printing every single file to delete. Moved it to TRACE.

Related Issues

Resolves #20564

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 31, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review
📝 Walkthrough

Walkthrough

This pull request introduces timeout-aware, batched cleanup for remote cluster state files with configurable parameters. It adds TimeValue-based timeout handling across blob container deletion operations, implements batched manifest deletion in RemoteClusterStateCleanupManager, and introduces three new cluster settings for batch size, maximum batches, and cleanup timeout to prevent deletion operation timeouts from cascading stale-file buildup.

Changes

Cohort / File(s) Summary
Blob Container Timeout Support
plugins/repository-s3/src/main/java/org/opensearch/repositories/s3/S3BlobContainer.java, server/src/main/java/org/opensearch/common/blobstore/BlobContainer.java, server/src/main/java/org/opensearch/common/blobstore/EncryptedBlobContainer.java
Added TimeValue-based timeout overloads for deleteBlobsIgnoringIfNotExists. S3BlobContainer replaces hard-coded 30s timeout with configurable TimeValue, updates getFutureValue to handle infinite (MINUS_ONE) vs finite timeouts, and enhanced error messaging. BlobContainer and EncryptedBlobContainer delegate to timeout-aware variants.
Transfer Service Timeout Integration
server/src/main/java/org/opensearch/index/translog/transfer/TransferService.java, server/src/main/java/org/opensearch/index/translog/transfer/BlobStoreTransferService.java
Added timeout-aware overloads for deleteBlobs. TransferService provides default method throwing UnsupportedOperationException; BlobStoreTransferService implements timeout delegation to blob container.
Routing Service Timeout Overloads
server/src/main/java/org/opensearch/cluster/routing/remote/RemoteRoutingTableService.java, server/src/main/java/org/opensearch/cluster/routing/remote/InternalRemoteRoutingTableService.java, server/src/main/java/org/opensearch/cluster/routing/remote/NoopRemoteRoutingTableService.java
Added timeout-aware overloads for deleteStaleIndexRoutingPaths and deleteStaleIndexRoutingDiffPaths across the interface hierarchy. InternalRemoteRoutingTableService implements timeout propagation; others provide default/no-op variants.
Batched Cleanup Implementation
server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManager.java, server/src/main/java/org/opensearch/gateway/remote/RemoteStateCleanupFailedException.java, server/src/main/java/org/opensearch/common/settings/ClusterSettings.java
Introduced batched manifest cleanup with configurable batch size, max batches, and timeout. Added three new cluster settings (BATCH_SIZE, MAX_BATCHES, TIMEOUT) with defaults. Implemented exception wrapping via RemoteStateCleanupFailedException for better error handling in batch loops.
Unit Tests: Timeout Behavior
plugins/repository-s3/src/test/java/org/opensearch/repositories/s3/S3BlobContainerTimeoutTests.java, server/src/test/java/org/opensearch/common/blobstore/BlobContainerTests.java, server/src/test/java/org/opensearch/common/blobstore/EncryptedBlobContainerTests.java, server/src/test/java/org/opensearch/index/translog/transfer/BlobStoreTransferServiceTests.java
Added tests verifying timeout parameter handling, delegation, and default method behavior across blob container and transfer service layers.
Integration & Manager Tests
server/src/internalClusterTest/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManagerIT.java, server/src/test/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManagerTests.java
Added integration test for multi-batch deletion flow and comprehensive unit tests for batched cleanup behavior, settings updates, timeout propagation, and edge cases (exhausted max batches, single/multiple batches, one-new-manifest scenarios).

Sequence Diagram(s)

sequenceDiagram
    participant CM as ClusterManager
    participant RCC as RemoteClusterStateCleanupManager
    participant BSTS as BlobStoreTransferService
    participant BC as BlobContainer
    participant RRTS as RemoteRoutingTableService
    
    CM->>RCC: scheduleCleanup()
    activate RCC
    RCC->>RCC: listAllManifests()
    RCC->>RCC: identifyStaleFiles()
    RCC->>RCC: initBatchLoop(batchSize=1000, maxBatches=100)
    
    loop for each batch until stale list exhausted or maxBatches reached
        RCC->>RRTS: deleteStaleIndexRoutingPaths(stalePaths, timeout)
        activate RRTS
        RRTS->>BC: deleteBlobsIgnoringIfNotExists(files, timeout)
        activate BC
        BC-->>RRTS: success
        deactivate BC
        RRTS-->>RCC: complete
        deactivate RRTS
        
        RCC->>RRTS: deleteStaleIndexRoutingDiffPaths(diffPaths, timeout)
        activate RRTS
        RRTS->>BC: deleteBlobsIgnoringIfNotExists(files, timeout)
        activate BC
        BC-->>RRTS: success
        deactivate BC
        RRTS-->>RCC: complete
        deactivate RRTS
        
        RCC->>BSTS: deleteBlobs(manifestPaths, timeout)
        activate BSTS
        BSTS->>BC: deleteBlobsIgnoringIfNotExists(manifests, timeout)
        activate BC
        BC-->>BSTS: success
        deactivate BC
        BSTS-->>RCC: complete
        deactivate BSTS
        
        alt Exception in batch
            RCC->>RCC: wrapInRemoteStateCleanupFailedException()
            RCC->>RCC: log error & continue next batch
        end
    end
    deactivate RCC
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • opensearch-project/OpenSearch#20146: Touches the deleteBlobsIgnoringIfNotExists API and adds callers that invoke timeout-aware deletion methods, complementing the timeout parameter additions in this PR.

Suggested labels

>bug, >enhancement, storage, remote-store

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main changes: implementing batched deletions of ClusterMetadataManifests and adding configurable deletion timeouts in RemoteClusterStateCleanupManager.
Description check ✅ Passed The description addresses the problem context, explains the root cause, details the solutions implemented, and lists related issues and checklist items following the template structure.
Linked Issues check ✅ Passed The PR successfully addresses all coding requirements from issue #20564: batching manifest deletions, making timeouts configurable, reordering deletions to avoid dangling routing paths, preventing state version advancement on failures, and reducing logging overhead.
Out of Scope Changes check ✅ Passed All code changes are directly aligned with the stated objectives: timeout handling additions, batched deletion implementation, configuration settings, test coverage, and exception handling to prevent cascade failures.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 0077d20: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@Pranshu-S Pranshu-S force-pushed the remote-state-cleanup-manager-batched-deletions-and-timeouts branch from 0077d20 to 1624616 Compare February 1, 2026 04:54
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 1, 2026

❌ Gradle check result for 1624616: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@Pranshu-S Pranshu-S force-pushed the remote-state-cleanup-manager-batched-deletions-and-timeouts branch from 1624616 to 66ef9d4 Compare February 1, 2026 08:57
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 1, 2026

❌ Gradle check result for 66ef9d4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@Pranshu-S Pranshu-S force-pushed the remote-state-cleanup-manager-batched-deletions-and-timeouts branch from 66ef9d4 to 0556b54 Compare February 2, 2026 04:39
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 2, 2026

❌ Gradle check result for 0556b54: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@Pranshu-S Pranshu-S force-pushed the remote-state-cleanup-manager-batched-deletions-and-timeouts branch from 0556b54 to 50de664 Compare February 2, 2026 05:02
@Pranshu-S
Copy link
Copy Markdown
Contributor Author

Appears to be stuck -

Use queue information to find build number in Jenkins if available
jq: parse error: Invalid numeric literal at line 1, column 7
WORKFLOW_URL
Job not started yet. Waiting for 60 seconds before next attempt.
time passed: 3960

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 2, 2026

❌ Gradle check result for a8cb47a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 2, 2026

❌ Gradle check result for 549e3e0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@Pranshu-S
Copy link
Copy Markdown
Contributor Author

Looks like the execution has not started.

Job not started yet. Waiting for 60 seconds before next attempt.
time passed: 5340
Use queue information to find build number in Jenkins if available
jq: parse error: Invalid numeric literal at line 1, column 7
WORKFLOW_URL

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 2, 2026

❌ Gradle check result for 638eb6f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 3, 2026

❌ Gradle check result for a00b057: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@Pranshu-S
Copy link
Copy Markdown
Contributor Author

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 3, 2026

❌ Gradle check result for 300eff0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 3, 2026

❌ Gradle check result for 79dc692: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 4, 2026

❌ Gradle check result for 193258b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@Pranshu-S Pranshu-S force-pushed the remote-state-cleanup-manager-batched-deletions-and-timeouts branch from 193258b to 5a55b86 Compare February 4, 2026 11:08
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 4, 2026

❌ Gradle check result for 5a55b86: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@Pranshu-S Pranshu-S force-pushed the remote-state-cleanup-manager-batched-deletions-and-timeouts branch 2 times, most recently from 70519d4 to e60a6bf Compare February 4, 2026 15:48
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 4, 2026

❌ Gradle check result for e60a6bf: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@Pranshu-S Pranshu-S force-pushed the remote-state-cleanup-manager-batched-deletions-and-timeouts branch from e60a6bf to 21d337e Compare February 5, 2026 02:22
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 5, 2026

❌ Gradle check result for 21d337e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…emoteClusterStateCleanupManager and adding configurable deletions timeouts

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
…manager-batched-deletions-and-timeouts

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
@Pranshu-S Pranshu-S force-pushed the remote-state-cleanup-manager-batched-deletions-and-timeouts branch from 21d337e to d371ec4 Compare February 5, 2026 05:21
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 5, 2026

❌ Gradle check result for d371ec4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
plugins/repository-s3/src/main/java/org/opensearch/repositories/s3/S3BlobContainer.java (1)

449-468: ⚠️ Potential issue | 🟠 Major

Precision loss: timeout.seconds() truncates sub-second TimeValue to 0, causing immediate TimeoutException.

TimeValue.seconds() performs integer division, so a value like TimeValue.timeValueMillis(500) yields 0. Calling future.get(0, SECONDS) will time out instantly. Use millis() with TimeUnit.MILLISECONDS to preserve the full resolution of the provided TimeValue.

The error message on line 467 has the same issue — it would report "0 seconds" for sub-second timeouts.

🐛 Proposed fix
         try {
             if (timeout.equals(TimeValue.MINUS_ONE)) {
                 return future.get();
             } else {
-                return future.get(timeout.seconds(), TimeUnit.SECONDS);
+                return future.get(timeout.millis(), TimeUnit.MILLISECONDS);
             }
         } catch (InterruptedException e) {
             Thread.currentThread().interrupt();
             throw new IOException("Future got interrupted", e);
         } catch (ExecutionException e) {
             if (e.getCause() instanceof IOException) {
                 throw (IOException) e.getCause();
             }
             throw new RuntimeException(e.getCause());
         } catch (TimeoutException e) {
             FutureUtils.cancel(future);
-            throw new IOException(String.format(Locale.ROOT, "Delete operation timed out after %s seconds", timeout.seconds()), e);
+            throw new IOException(String.format(Locale.ROOT, "Delete operation timed out after [%s]", timeout), e);
         }
server/src/test/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManagerTests.java (1)

773-787: ⚠️ Potential issue | 🟡 Minor

Spy cleanUpManager is created but never used.

Line 785 calls remoteClusterStateCleanupManager.cleanUpStaleFiles() (the original) instead of cleanUpManager.cleanUpStaleFiles() (the spy). The test passes because the skip-path doesn't reach deleteStaleClusterMetadata regardless, but this is inconsistent with testRemoteCleanupCallsDeleteIfVersionIncrementGreaterThanThreshold (line 802) which correctly calls cleanUpManager.cleanUpStaleFiles().

Suggested fix
-        remoteClusterStateCleanupManager.cleanUpStaleFiles();
+        cleanUpManager.cleanUpStaleFiles();
🤖 Fix all issues with AI agents
In
`@server/src/internalClusterTest/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManagerIT.java`:
- Around line 340-362: This test omits disabling the periodic cleanup before
teardown which can race with shutdown; after the final assertions (after
computing last10Final) update the cleanup interval to -1 the same way other
tests do so the background task stops — e.g., call the
RemoteClusterStateCleanupManager cleanup interval setting setter (set the
RemoteClusterStateCleanupManager cleanup interval setting to -1) or invoke the
same helper used in
testRemoteCleanupDeleteStale/testRemoteCleanupDeleteStaleIndexRoutingFiles to
disable the cleanup task before the test exits.

In
`@server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManager.java`:
- Around line 525-538: The Runnable submitted via threadpool.executor in
RemoteClusterStateCleanupManager currently throws
RemoteStateCleanupFailedException which becomes an uncaught exception on the
worker thread; instead remove the throw from inside the executor and replace it
with logging plus setting a failure indicator so callers can observe the failure
without crashing the thread. Specifically, in the Runnable (the block that logs
"Exception occurred while deleting Remote Cluster Metadata...") stop rethrowing
RemoteStateCleanupFailedException, keep/emphasize logger.error with the
exception, and set a class-level failure flag or status field (e.g., a boolean
like deleteStaleMetadataFailed or a status enum) so the outer manager can react;
update any code that previously relied on the exception to check that flag (and
re-evaluate that behavior once you make version advancement conditional on
success).
- Around line 217-230: The try/catch around deleteStaleClusterMetadata in
cleanUpStaleFiles is ineffective because deleteStaleClusterMetadata submits work
asynchronously via threadpool.executor(REMOTE_PURGE).execute(...) and returns
immediately, so RemoteStateCleanupFailedException thrown in the Runnable never
reaches this catch and lastCleanupAttemptStateVersion is always advanced; fix by
moving the update of lastCleanupAttemptStateVersion into the asynchronous
success path (i.e., set lastCleanupAttemptStateVersion =
cleanUpAttemptStateVersion only after the cleanup loop inside
deleteStaleClusterMetadata completes without error), and ensure when catching
RemoteStateCleanupFailedException inside the Runnable you log the exception
(include the Throwable) and do NOT advance lastCleanupAttemptStateVersion there;
alternatively make deleteStaleClusterMetadata synchronous if you prefer to keep
the update in cleanUpStaleFiles.

In
`@server/src/test/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManagerTests.java`:
- Around line 1172-1177: Update the assertion message in
RemoteClusterStateCleanupManagerTests to accurately reflect the checked value:
the test currently asserts that the second invocation contains "manifest-91.dat"
via staleInvocations/staleCaptor but the message says "manifest-90"; change the
message to refer to "manifest-91.dat" (or alternately adjust the predicate to
check for "manifest-90" if that was intended) so the assertion message and the
anyMatch check on staleInvocations.get(0).stream() are consistent.
🧹 Nitpick comments (9)
server/src/main/java/org/opensearch/gateway/remote/RemoteStateCleanupFailedException.java (1)

22-25: Redundant toString() override.

This override simply delegates to super.toString(), which is the default behavior. It can be removed.

♻️ Suggested removal
 public class RemoteStateCleanupFailedException extends RuntimeException {

     public RemoteStateCleanupFailedException(ParameterizedMessage parameterizedMessage, Throwable cause) {
         super(parameterizedMessage.toString(), cause);
     }
-
-    `@Override`
-    public String toString() {
-        return super.toString();
-    }
 }
plugins/repository-s3/src/test/java/org/opensearch/repositories/s3/S3BlobContainerTimeoutTests.java (1)

39-95: Consider adding edge-case tests for timeout behavior.

The current tests cover the happy path (successful completion) and verify call delegation, which is good. Consider adding tests for:

  • TimeValue.MINUS_ONE → verifies future.get() (infinite wait) path
  • A future that doesn't complete within the timeout → verifies TimeoutException handling and future cancellation

These would exercise the branching logic in getFutureValue more thoroughly.

server/src/main/java/org/opensearch/common/blobstore/BlobContainer.java (1)

285-295: Javadoc is missing @param timeout documentation.

The new timeout parameter is not documented in the Javadoc. Also, consider whether warn level is appropriate here — if a concrete implementation never overrides this method, every cleanup call will emit a warning. debug or info might be less noisy for a fallback path that's expected for most BlobContainer implementations.

Proposed Javadoc and log-level fix
     /**
      * Deletes the blobs with given names. This method will not throw an exception
      * when one or multiple of the given blobs don't exist and simply ignore this case.
      *
      * `@param`   blobNames  The names of the blob to delete.
+     * `@param`   timeout    The timeout for the delete operation. Implementations that do not
+     *                     support timeouts will ignore this parameter and delegate to
+     *                     {`@link` `#deleteBlobsIgnoringIfNotExists`(List)}.
      * `@throws`  IOException if a subset of blob exists but could not be deleted.
      */
     default void deleteBlobsIgnoringIfNotExists(List<String> blobNames, TimeValue timeout) throws IOException {
-        LogManager.getLogger(this.getClass()).warn("No implementation exists for using timeout, skipping timeout input");
+        LogManager.getLogger(this.getClass()).debug("No implementation exists for using timeout, skipping timeout input");
         deleteBlobsIgnoringIfNotExists(blobNames);
     }
server/src/internalClusterTest/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManagerIT.java (1)

324-331: Prefer static import over inline fully-qualified class reference.

java.util.stream.Collectors is used inline on lines 330 and 347 as java.util.stream.Collectors.toList(). Consider adding a static import for consistency with the rest of the file's import style.

Add import at top of file
+import java.util.stream.Collectors;

Then replace java.util.stream.Collectors.toList() with Collectors.toList() on lines 330 and 347.

server/src/main/java/org/opensearch/cluster/routing/remote/RemoteRoutingTableService.java (1)

80-86: Consider delegating to the non-timeout variant instead of throwing.

The BlobContainer interface handles the same pattern by falling back to the non-timeout overload in its default method. Here, the default throws UnsupportedOperationException, which means any RemoteRoutingTableService implementation that doesn't override these will fail at runtime rather than degrade gracefully.

Since InternalRemoteRoutingTableService does override these, this isn't an immediate issue, but a consistent fallback pattern would be more defensive.

Proposed fallback to non-timeout variant
     default void deleteStaleIndexRoutingPaths(List<String> stalePaths, TimeValue timeout) throws IOException {
-        throw new UnsupportedOperationException("deleteStaleIndexRoutingPaths with timeouts is not supported");
+        deleteStaleIndexRoutingPaths(stalePaths);
     }

     default void deleteStaleIndexRoutingDiffPaths(List<String> stalePaths, TimeValue timeout) throws IOException {
-        throw new UnsupportedOperationException("deleteStaleIndexRoutingDiffPaths with timeouts is not supported");
+        deleteStaleIndexRoutingDiffPaths(stalePaths);
     }
server/src/main/java/org/opensearch/cluster/routing/remote/InternalRemoteRoutingTableService.java (1)

271-311: Consider reducing duplication between timeout and non-timeout variants.

The four methods (lines 271–311) share near-identical structure — the only difference is the extra TimeValue parameter. You could have the non-timeout methods delegate to the timeout variants with a null timeout, reducing the duplicated try/catch/log/rethrow blocks.

Example: delegate non-timeout to timeout variant
     `@Override`
     public void deleteStaleIndexRoutingPaths(List<String> stalePaths) throws IOException {
-        try {
-            logger.debug(() -> "Deleting stale index routing files from remote - " + stalePaths);
-            blobStoreRepository.blobStore().blobContainer(BlobPath.cleanPath()).deleteBlobsIgnoringIfNotExists(stalePaths);
-        } catch (IOException e) {
-            logger.error(() -> new ParameterizedMessage("Failed to delete some stale index routing paths from {}", stalePaths), e);
-            throw e;
-        }
+        deleteStaleIndexRoutingPaths(stalePaths, null);
     }

     `@Override`
     public void deleteStaleIndexRoutingPaths(List<String> stalePaths, TimeValue timeValue) throws IOException {
         try {
             logger.debug(() -> "Deleting stale index routing files from remote - " + stalePaths);
-            blobStoreRepository.blobStore().blobContainer(BlobPath.cleanPath()).deleteBlobsIgnoringIfNotExists(stalePaths, timeValue);
+            if (timeValue != null) {
+                blobStoreRepository.blobStore().blobContainer(BlobPath.cleanPath()).deleteBlobsIgnoringIfNotExists(stalePaths, timeValue);
+            } else {
+                blobStoreRepository.blobStore().blobContainer(BlobPath.cleanPath()).deleteBlobsIgnoringIfNotExists(stalePaths);
+            }
         } catch (IOException e) {
             logger.error(() -> new ParameterizedMessage("Failed to delete some stale index routing paths from {}", stalePaths), e);
             throw e;
         }
     }

Same pattern for the diff-path methods.

server/src/test/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManagerTests.java (1)

908-1013: Good batch-processing test coverage.

The four batched-deletion tests (testBatchedDeletionWithSingleBatch, testBatchedDeletionWithMultipleBatches, testBatchedDeletionStopsWhenManifestsLessThanRetained, testBatchedDeletionExhaustsMaxBatches) comprehensively cover the key scenarios. However, there is no test for the failure path where deleteClusterMetadata throws mid-batch and RemoteStateCleanupFailedException is raised. Adding such a test would help verify batch-level error handling and, importantly, would surface the lastCleanupAttemptStateVersion-advancement issue described in the production code review.

server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManager.java (2)

179-201: Rescheduling the async deletion task on non-interval setting changes is unnecessary.

rescheduleAsyncDeletionTask() is called when cleanupBatchSize, cleanupMaxBatches, or cleanupTimeout are updated. These don't affect the task's scheduling interval — the next run will simply pick up the new field values. Only updateCleanupInterval genuinely needs to reschedule. The rescheduleIfNecessary() call is harmless but adds noise.


111-113: Consider volatile for fields written on the cluster-applier thread and read on REMOTE_PURGE.

cleanupBatchSize, cleanupMaxBatches, and cleanupTimeout are written by settings-update consumers (cluster-applier thread) and read by the batch-processing loop (REMOTE_PURGE thread). Without volatile, updated values may not be visible to the reader thread due to the Java Memory Model. This is consistent with the existing non-volatile staleFileCleanupInterval, but the new fields are read in a tight loop during batch processing, increasing the window for stale reads.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6ab06b and 181b0fb.

📒 Files selected for processing (17)
  • plugins/repository-s3/src/main/java/org/opensearch/repositories/s3/S3BlobContainer.java
  • plugins/repository-s3/src/test/java/org/opensearch/repositories/s3/S3BlobContainerTimeoutTests.java
  • server/src/internalClusterTest/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManagerIT.java
  • server/src/main/java/org/opensearch/cluster/routing/remote/InternalRemoteRoutingTableService.java
  • server/src/main/java/org/opensearch/cluster/routing/remote/NoopRemoteRoutingTableService.java
  • server/src/main/java/org/opensearch/cluster/routing/remote/RemoteRoutingTableService.java
  • server/src/main/java/org/opensearch/common/blobstore/BlobContainer.java
  • server/src/main/java/org/opensearch/common/blobstore/EncryptedBlobContainer.java
  • server/src/main/java/org/opensearch/common/settings/ClusterSettings.java
  • server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManager.java
  • server/src/main/java/org/opensearch/gateway/remote/RemoteStateCleanupFailedException.java
  • server/src/main/java/org/opensearch/index/translog/transfer/BlobStoreTransferService.java
  • server/src/main/java/org/opensearch/index/translog/transfer/TransferService.java
  • server/src/test/java/org/opensearch/common/blobstore/BlobContainerTests.java
  • server/src/test/java/org/opensearch/common/blobstore/EncryptedBlobContainerTests.java
  • server/src/test/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManagerTests.java
  • server/src/test/java/org/opensearch/index/translog/transfer/BlobStoreTransferServiceTests.java
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2026-01-13T17:40:27.167Z
Learnt from: reta
Repo: opensearch-project/OpenSearch PR: 20411
File: server/src/main/java/org/opensearch/index/codec/CodecService.java:112-133
Timestamp: 2026-01-13T17:40:27.167Z
Learning: Avoid capturing or evaluating a supplier (e.g., this::defaultCodec) upfront when passing it to a registry during object construction. If registries may replace defaults during iteration (as in EnginePlugin.getAdditionalCodecs), pass the supplier itself and only resolve it at use time. This ensures dynamic behavior is preserved during initialization and prevents premature binding of defaults in codecs/registry setup. This pattern should apply to similar initialization paths in Java server code where registries may mutate defaults during construction.

Applied to files:

  • server/src/main/java/org/opensearch/index/translog/transfer/BlobStoreTransferService.java
  • server/src/main/java/org/opensearch/gateway/remote/RemoteStateCleanupFailedException.java
  • server/src/main/java/org/opensearch/common/blobstore/BlobContainer.java
  • server/src/main/java/org/opensearch/cluster/routing/remote/NoopRemoteRoutingTableService.java
  • server/src/main/java/org/opensearch/common/blobstore/EncryptedBlobContainer.java
  • server/src/main/java/org/opensearch/cluster/routing/remote/RemoteRoutingTableService.java
  • server/src/main/java/org/opensearch/index/translog/transfer/TransferService.java
  • server/src/main/java/org/opensearch/common/settings/ClusterSettings.java
  • server/src/main/java/org/opensearch/cluster/routing/remote/InternalRemoteRoutingTableService.java
  • server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManager.java
📚 Learning: 2026-01-14T09:31:12.773Z
Learnt from: udabhas
Repo: opensearch-project/OpenSearch PR: 20095
File: server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java:4146-4165
Timestamp: 2026-01-14T09:31:12.773Z
Learning: For OpenSearch’s S3 path, index-level CryptoMetadata is passed from BlobStoreRepository to S3BlobContainer, and the final KMS key/context merge with repository settings is performed in SseKmsUtil.configureEncryptionSettings. Do not merge in BlobStoreRepository to avoid double-merging.

Applied to files:

  • server/src/test/java/org/opensearch/common/blobstore/EncryptedBlobContainerTests.java
  • server/src/test/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManagerTests.java
  • plugins/repository-s3/src/main/java/org/opensearch/repositories/s3/S3BlobContainer.java
🧬 Code graph analysis (3)
server/src/test/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManagerTests.java (3)
server/src/main/java/org/opensearch/index/translog/transfer/BlobStoreTransferService.java (1)
  • BlobStoreTransferService (52-376)
server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManager.java (2)
  • RemoteClusterStateCleanupManager (48-655)
  • AsyncStaleFileDeletion (626-654)
server/src/main/java/org/opensearch/threadpool/ThreadPool.java (1)
  • Names (99-130)
server/src/main/java/org/opensearch/common/settings/ClusterSettings.java (1)
server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManager.java (1)
  • RemoteClusterStateCleanupManager (48-655)
server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManager.java (1)
server/src/main/java/org/opensearch/threadpool/ThreadPool.java (1)
  • Names (99-130)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: detect-breaking-change
🔇 Additional comments (17)
server/src/main/java/org/opensearch/common/blobstore/EncryptedBlobContainer.java (1)

183-186: LGTM!

Consistent delegation pattern matching the existing non-timeout overload.

server/src/main/java/org/opensearch/index/translog/transfer/TransferService.java (1)

105-107: LGTM!

Backward-compatible interface extension with a sensible UnsupportedOperationException default, overridden by BlobStoreTransferService.

plugins/repository-s3/src/main/java/org/opensearch/repositories/s3/S3BlobContainer.java (2)

125-125: LGTM!

Good change from a raw long to a typed TimeValue constant, improving readability and consistency with the rest of the timeout-aware API surface.


441-446: LGTM!

Clean timeout-forwarding overload, consistent with the non-timeout variant above.

server/src/main/java/org/opensearch/index/translog/transfer/BlobStoreTransferService.java (1)

288-291: LGTM!

Clean override of the TransferService default, consistently delegating to the timeout-aware blob container method.

server/src/main/java/org/opensearch/cluster/routing/remote/NoopRemoteRoutingTableService.java (1)

125-133: LGTM!

Noop overloads consistent with the existing patterns in this class.

server/src/test/java/org/opensearch/common/blobstore/EncryptedBlobContainerTests.java (1)

317-328: LGTM!

Clean test verifying timeout-aware delegation from EncryptedBlobContainer to the underlying BlobContainer.

server/src/main/java/org/opensearch/common/settings/ClusterSettings.java (1)

784-786: LGTM!

New cleanup settings are properly registered alongside existing remote cluster state settings. The settings definitions (from RemoteClusterStateCleanupManager) include appropriate validation bounds and are correctly scoped as NodeScope + Dynamic.

server/src/test/java/org/opensearch/common/blobstore/BlobContainerTests.java (1)

25-33: LGTM!

Good use of doCallRealMethod to exercise the default interface method and verify it delegates to the non-timeout overload.

server/src/test/java/org/opensearch/index/translog/transfer/BlobStoreTransferServiceTests.java (2)

78-96: LGTM on new mock scaffolding.

The mock fields and wiring are cleanly added without affecting existing tests (which construct their own TransferService instances).


319-327: LGTM on timeout delegation test.

Cleanly verifies that deleteBlobs(path, fileNames, timeout) delegates to deleteBlobsIgnoringIfNotExists(fileNames, timeout) on the underlying container.

server/src/test/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManagerTests.java (2)

297-325: LGTM on timeout-aware verifications in testDeleteClusterMetadata.

All deletion operations correctly verified with CLUSTER_STATE_CLEANUP_TIMEOUT_DEFAULT, and the never() + verifyNoMoreInteractions combination ensures completeness.


827-906: Good coverage of new settings defaults and dynamic updates.

These tests thoroughly validate default values, dynamic updates via clusterSettings.applySettings, and correct propagation.

server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManager.java (4)

52-56: LGTM on new default constants.

Clear naming and appropriate default values. CLUSTER_STATE_CLEANUP_TIMEOUT_DEFAULT = TimeValue.MINUS_ONE as "no timeout" is consistent with OpenSearch conventions.


73-105: LGTM on new settings.

Batch size minimum of RETAINED_MANIFESTS + SKIP_CLEANUP_STATE_CHANGES (= 20) ensures there's always room for retention. Max batches minimum of 1 and timeout minimum of MINUS_ONE are appropriate.


426-453: Good ordering: manifests deleted last.

Deleting stale routing paths and routing diff paths before their manifests avoids leaving dangling routing files if a subsequent batch or run fails. This aligns with the PR objective.


486-543: LGTM on the batch-processing loop structure.

The batch loop with configurable cleanupBatchSize and cleanupMaxBatches, the guard via deleteStaleMetadataRunning, and the finally block releasing the lock are well-structured. The early exit when batchManifests.size() <= manifestsToRetain correctly terminates processing.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +340 to +362
assertBusy(() -> {
List<String> currentManifests = repository.blobStore()
.blobContainer(manifestContainerPath)
.listBlobsByPrefix("manifest")
.keySet()
.stream()
.sorted()
.collect(java.util.stream.Collectors.toList());
assertTrue("Manifests should be cleaned up in batches", currentManifests.size() < initialManifests.size());
});

List<String> finalManifests = repository.blobStore()
.blobContainer(manifestContainerPath)
.listBlobsByPrefix("manifest")
.keySet()
.stream()
.sorted()
.collect(java.util.stream.Collectors.toList());
List<String> last10Final = finalManifests.subList(Math.max(0, finalManifests.size() - 10), finalManifests.size());

assertEquals("Last 10 manifest files should remain the same after cleanup", last10Initial, last10Final);

}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing cleanup disable before test teardown — may cause flaky shutdown.

Other tests in this class (e.g., testRemoteCleanupDeleteStale at lines 161-168, testRemoteCleanupDeleteStaleIndexRoutingFiles at lines 239-244) disable the cleanup interval (-1) after assertions to avoid race conditions during shutdown. This test omits that step, which could lead to intermittent failures if a cleanup task fires during cluster teardown.

Proposed fix: disable cleanup before test ends
         assertEquals("Last 10 manifest files should remain the same after cleanup", last10Initial, last10Final);

+        // disable the clean up to avoid race condition during shutdown
+        response = client().admin()
+            .cluster()
+            .prepareUpdateSettings()
+            .setPersistentSettings(Settings.builder().put(REMOTE_CLUSTER_STATE_CLEANUP_INTERVAL_SETTING.getKey(), "-1"))
+            .get();
+        assertTrue(response.isAcknowledged());
     }
🤖 Prompt for AI Agents
In
`@server/src/internalClusterTest/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManagerIT.java`
around lines 340 - 362, This test omits disabling the periodic cleanup before
teardown which can race with shutdown; after the final assertions (after
computing last10Final) update the cleanup interval to -1 the same way other
tests do so the background task stops — e.g., call the
RemoteClusterStateCleanupManager cleanup interval setting setter (set the
RemoteClusterStateCleanupManager cleanup interval setting to -1) or invoke the
same helper used in
testRemoteCleanupDeleteStale/testRemoteCleanupDeleteStaleIndexRoutingFiles to
disable the cleanup task before the test exits.

Comment on lines +217 to +230
try {
this.deleteStaleClusterMetadata(
currentAppliedState.getClusterName().value(),
currentAppliedState.metadata().clusterUUID(),
RETAINED_MANIFESTS
);
lastCleanupAttemptStateVersion = cleanUpAttemptStateVersion;
} catch (RemoteStateCleanupFailedException e) {
logger.error(
"Failed to clean up stale remote cluster state files for cluster [{}] with uuid [{}]",
currentAppliedState.getClusterName().value(),
currentAppliedState.metadata().clusterUUID()
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

catch (RemoteStateCleanupFailedException) is dead code — lastCleanupAttemptStateVersion is always advanced.

deleteStaleClusterMetadata (line 480) submits the batched cleanup to threadpool.executor(REMOTE_PURGE).execute(…) and returns immediately without throwing. The RemoteStateCleanupFailedException is thrown inside the executor's Runnable (line 531), which is on a different call stack. It never propagates back to this try-catch.

Consequence: lastCleanupAttemptStateVersion = cleanUpAttemptStateVersion at line 223 always executes, regardless of cleanup success or failure. This defeats the stated PR objective of preventing version advancement on failure, and means a failed cleanup still moves the version forward — causing the next cleanUpStaleFiles invocation to skip cleanup if the state-version delta is below the threshold. Stale files will pile up exactly as described in issue #20564.

Additionally, the caught exception e is not included in the logger.error call (line 225), so even if this path were reachable, the stack trace would be lost.

A possible fix is to move the lastCleanupAttemptStateVersion update into the async success path, or to make deleteStaleClusterMetadata synchronous since cleanUpStaleFiles already runs on REMOTE_PURGE. For example:

Sketch: move version update into the async block
-            try {
-                this.deleteStaleClusterMetadata(
-                    currentAppliedState.getClusterName().value(),
-                    currentAppliedState.metadata().clusterUUID(),
-                    RETAINED_MANIFESTS
-                );
-                lastCleanupAttemptStateVersion = cleanUpAttemptStateVersion;
-            } catch (RemoteStateCleanupFailedException e) {
-                logger.error(
-                    "Failed to clean up stale remote cluster state files for cluster [{}] with uuid [{}]",
-                    currentAppliedState.getClusterName().value(),
-                    currentAppliedState.metadata().clusterUUID()
-                );
-            }
+            this.deleteStaleClusterMetadata(
+                currentAppliedState.getClusterName().value(),
+                currentAppliedState.metadata().clusterUUID(),
+                RETAINED_MANIFESTS,
+                cleanUpAttemptStateVersion
+            );

Then inside deleteStaleClusterMetadata, after the batch loop completes successfully:

// inside the try block, after the while loop
lastCleanupAttemptStateVersion = cleanUpAttemptStateVersion;
🤖 Prompt for AI Agents
In
`@server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManager.java`
around lines 217 - 230, The try/catch around deleteStaleClusterMetadata in
cleanUpStaleFiles is ineffective because deleteStaleClusterMetadata submits work
asynchronously via threadpool.executor(REMOTE_PURGE).execute(...) and returns
immediately, so RemoteStateCleanupFailedException thrown in the Runnable never
reaches this catch and lastCleanupAttemptStateVersion is always advanced; fix by
moving the update of lastCleanupAttemptStateVersion into the asynchronous
success path (i.e., set lastCleanupAttemptStateVersion =
cleanUpAttemptStateVersion only after the cleanup loop inside
deleteStaleClusterMetadata completes without error), and ensure when catching
RemoteStateCleanupFailedException inside the Runnable you log the exception
(include the Throwable) and do NOT advance lastCleanupAttemptStateVersion there;
alternatively make deleteStaleClusterMetadata synchronous if you prefer to keep
the update in cleanUpStaleFiles.

Comment on lines +525 to +538
} catch (Exception e) {
logger.error(
"Exception occurred while deleting Remote Cluster Metadata for clusterUUIDs [{}]. Exception: {}",
clusterUUID,
e
);
throw new RemoteStateCleanupFailedException(
new ParameterizedMessage(
"Exception occurred while deleting Remote Cluster Metadata for clusterUUIDs [{}]. Exception: {}",
clusterUUID,
e
),
e
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

RemoteStateCleanupFailedException thrown inside executor's Runnable becomes an uncaught exception.

Since this exception is thrown from within threadpool.executor(…).execute(Runnable), it is not caught by any application-level handler. It will propagate to the thread's UncaughtExceptionHandler (which in OpenSearch logs and replaces the thread). The finally block at line 539 correctly resets deleteStaleMetadataRunning, so the lock is released.

However, after the critical bug above is fixed (making the version advancement conditional on success), re-evaluate whether this throw is still needed or whether logging + setting a failure flag is more appropriate.

🤖 Prompt for AI Agents
In
`@server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManager.java`
around lines 525 - 538, The Runnable submitted via threadpool.executor in
RemoteClusterStateCleanupManager currently throws
RemoteStateCleanupFailedException which becomes an uncaught exception on the
worker thread; instead remove the throw from inside the executor and replace it
with logging plus setting a failure indicator so callers can observe the failure
without crashing the thread. Specifically, in the Runnable (the block that logs
"Exception occurred while deleting Remote Cluster Metadata...") stop rethrowing
RemoteStateCleanupFailedException, keep/emphasize logger.error with the
exception, and set a class-level failure flag or status field (e.g., a boolean
like deleteStaleMetadataFailed or a status enum) so the outer manager can react;
update any code that previously relied on the exception to check that flag (and
re-evaluate that behavior once you make version advancement conditional on
success).

Comment on lines +1172 to +1177
List<List<BlobMetadata>> staleInvocations = staleCaptor.getAllValues();
assertTrue(
"Second invocation should have manifest-90 in stale list",
staleInvocations.get(0).stream().anyMatch(blob -> "manifest-91.dat".equals(blob.name()))
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assertion message doesn't match the actual check.

The message says "manifest-90" but the predicate checks for "manifest-91.dat". Based on the test setup, the stale manifest in the second batch is indeed manifest-91.dat.

Suggested fix
         assertTrue(
-            "Second invocation should have manifest-90 in stale list",
+            "Second invocation should have manifest-91 in stale list",
             staleInvocations.get(0).stream().anyMatch(blob -> "manifest-91.dat".equals(blob.name()))
         );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
List<List<BlobMetadata>> staleInvocations = staleCaptor.getAllValues();
assertTrue(
"Second invocation should have manifest-90 in stale list",
staleInvocations.get(0).stream().anyMatch(blob -> "manifest-91.dat".equals(blob.name()))
);
}
List<List<BlobMetadata>> staleInvocations = staleCaptor.getAllValues();
assertTrue(
"Second invocation should have manifest-91 in stale list",
staleInvocations.get(0).stream().anyMatch(blob -> "manifest-91.dat".equals(blob.name()))
);
}
🤖 Prompt for AI Agents
In
`@server/src/test/java/org/opensearch/gateway/remote/RemoteClusterStateCleanupManagerTests.java`
around lines 1172 - 1177, Update the assertion message in
RemoteClusterStateCleanupManagerTests to accurately reflect the checked value:
the test currently asserts that the second invocation contains "manifest-91.dat"
via staleInvocations/staleCaptor but the message says "manifest-90"; change the
message to refer to "manifest-91.dat" (or alternately adjust the predicate to
check for "manifest-90" if that was intended) so the assertion message and the
anyMatch check on staleInvocations.get(0).stream() are consistent.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 6, 2026

✅ Gradle check result for 181b0fb: SUCCESS

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 6, 2026

Codecov Report

❌ Patch coverage is 88.99083% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.22%. Comparing base (93ae8db) to head (4e8d43a).
⚠️ Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
...teway/remote/RemoteClusterStateCleanupManager.java 94.44% 3 Missing and 1 partial ⚠️
...rg/opensearch/repositories/s3/S3BlobContainer.java 72.72% 2 Missing and 1 partial ⚠️
.../routing/remote/NoopRemoteRoutingTableService.java 0.00% 2 Missing ⚠️
...ster/routing/remote/RemoteRoutingTableService.java 0.00% 2 Missing ⚠️
...earch/index/translog/transfer/TransferService.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20515      +/-   ##
============================================
- Coverage     73.33%   73.22%   -0.11%     
+ Complexity    72125    72110      -15     
============================================
  Files          5798     5800       +2     
  Lines        329654   329814     +160     
  Branches      47491    47522      +31     
============================================
- Hits         241741   241521     -220     
- Misses        68504    68962     +458     
+ Partials      19409    19331      -78     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 6, 2026

✅ Gradle check result for 4e8d43a: SUCCESS

@Pranshu-S
Copy link
Copy Markdown
Contributor Author

Closing this PR in virtue of #20566

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Cluster Manager

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

[BUG] Remote Cluster State cleanup failures due to deletions timeouts resulting in stale-metadata pile-ups

2 participants