Implementing batched deletions of stale ClusterMetadataManifests in R… by Pranshu-S · Pull Request #20566 · opensearch-project/OpenSearch

Pranshu-S · 2026-02-06T16:10:15Z

Description

Fixes remote cluster state cleanup failures that were causing stale metadata pile-ups in remote storage when deletions time out along with fixes mention in tagged GitHub Issue

Issue context (#20564): RemoteClusterStateCleanupManager runs every ~5 minutes (configurable) and performs sequential deletions (global metadata → index metadata → ephemeral attrs → manifests). A recent change added a 30s timeout to the S3 “sync” delete path; with large delete sets this can throw IOException, abort the whole cleanup run (single try/catch), and leave later phases undeleted—making the next run even larger and more likely to fail.

What this PR does

Batch deletes stale manifest* blobs to reduce per-call delete size and avoid timeout/payload issues.
- Controlled by:
  - cluster.remote_store.state.cleanup.batch_size
  - cluster.remote_store.state.cleanup.max_batches
Makes delete timeout configurable for remote-state cleanup
Fixing issues where the update-interval call was not being honoured when the deletion was disabled (set to -1)
Fixing mis-match in deletions workflow where we were deleting manifests before the the IndexRouting paths which could lead to IndexRouting paths being dangled
Fixing issue where we were moving the lastCleanupAttemptStateVersion even though the cleanup were failing which resulted in next deletes to become no-op if there were no cluster-state changes (or less than 10) and previous deletions failed.
Reducing logging overhead of DEBUG as it was printing every single file to delete. Moved it to TRACE.

Related Issues

Resolves #20564

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

coderabbitai · 2026-02-06T16:10:24Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds configurable timeout and batched deletion support for remote cluster state cleanup operations. New settings control batch size, maximum batches, and deletion timeout. Timeout parameters flow through the deletion API chain from cleanup manager through transfer service to blob containers. S3BlobContainer gains configurable timeout support instead of hardcoded 30-second timeouts.

Changes

Cohort / File(s)	Summary
S3 Blob Container Timeout Support `plugins/repository-s3/src/main/java/.../S3BlobContainer.java`, `plugins/repository-s3/src/test/java/.../S3BlobContainerTimeoutTests.java`	Added `getFutureValue(T, TimeValue timeout)` method and `deleteBlobsIgnoringIfNotExists(List<String>, TimeValue timeout)` overload. Replaced hardcoded 30-second timeout with configurable timeout parameter. Updated error messages to reflect actual timeout values. New test class validates timeout propagation and behavior.
Blob Container API Extensions `server/src/main/java/.../BlobContainer.java`, `server/src/main/java/.../EncryptedBlobContainer.java`, `server/src/test/java/.../BlobContainerTests.java`, `server/src/test/java/.../EncryptedBlobContainerTests.java`	Added default `deleteBlobsIgnoringIfNotExists(List<String>, TimeValue timeout)` method to BlobContainer interface and concrete implementation in EncryptedBlobContainer. New tests verify timeout delegation and default handling.
Transfer Service Timeout Overloads `server/src/main/java/.../TransferService.java`, `server/src/main/java/.../BlobStoreTransferService.java`, `server/src/test/java/.../BlobStoreTransferServiceTests.java`	Added `deleteBlobs(Iterable<String>, List<String>, TimeValue timeout)` default method to TransferService interface and implementation in BlobStoreTransferService. New test validates timeout-aware deletion delegation.
Remote Routing Table Service Timeout Methods `server/src/main/java/.../RemoteRoutingTableService.java`, `server/src/main/java/.../InternalRemoteRoutingTableService.java`, `server/src/main/java/.../NoopRemoteRoutingTableService.java`, `server/src/test/java/.../RemoteRoutingTableServiceTests.java`	Added timeout-accepting overloads for `deleteStaleIndexRoutingPaths` and `deleteStaleIndexRoutingDiffPaths` across the service hierarchy. Interface provides default unsupported exception; implementations delegate with timeout. Four new test methods verify timeout parameter propagation.
Cleanup Configuration and Settings `server/src/main/java/.../RemoteClusterStateCleanupManager.java`, `server/src/main/java/.../ClusterSettings.java`	Added three new settings: `REMOTE_CLUSTER_STATE_CLEANUP_BATCH_SIZE_SETTING`, `REMOTE_CLUSTER_STATE_CLEANUP_TIMEOUT_SETTING`, `REMOTE_CLUSTER_STATE_CLEANUP_MAX_BATCHES_SETTING`. Wired runtime configuration with dynamic update consumers. Registered settings in ClusterSettings.
Cleanup Manager Batched Deletion `server/src/main/java/.../RemoteClusterStateCleanupManager.java`, `server/src/internalClusterTest/java/.../RemoteClusterStateCleanupManagerIT.java`, `server/src/test/java/.../RemoteClusterStateCleanupManagerTests.java`	Refactored `deleteStaleClusterMetadata` to process manifests in configurable batches instead of all-at-once. Added batch counter, condition checks against manifest retention threshold, and exhaustion logic. Integrated `cleanupTimeout` into `deleteStalePaths` calls. Made `getBlobStoreTransferService()` package-private for testing. Updated tests with timeout awareness and new integration test for multi-batch deletion.

Sequence Diagram

sequenceDiagram
    actor Scheduler
    participant RCSCM as RemoteClusterStateCleanupManager
    participant RoutingService as InternalRemoteRoutingTableService
    participant TransferService as BlobStoreTransferService
    participant BlobContainer as BlobContainer
    participant S3 as S3BlobContainer

    Scheduler->>RCSCM: deleteStaleClusterMetadata(clusterName, UUID, retainCount)
    note over RCSCM: Load manifest files
    loop For each batch (limited by cleanupMaxBatches)
        RCSCM->>RCSCM: Extract next batch (cleanupBatchSize)
        alt Batch manifests exceed retainCount
            RCSCM->>RCSCM: deleteStalePaths (global metadata)
            RCSCM->>TransferService: deleteBlobs(path, files, cleanupTimeout)
            TransferService->>BlobContainer: deleteBlobsIgnoringIfNotExists(files, cleanupTimeout)
            BlobContainer->>S3: deleteBlobsIgnoringIfNotExists(files, cleanupTimeout)
            S3->>S3: getFutureValue(future, cleanupTimeout)
            S3-->>BlobContainer: deletion complete
            BlobContainer-->>TransferService: success
            TransferService-->>RCSCM: success
            
            RCSCM->>RoutingService: deleteStaleIndexRoutingPaths(paths, cleanupTimeout)
            RoutingService->>TransferService: deleteBlobs(path, files, cleanupTimeout)
            TransferService->>BlobContainer: deleteBlobsIgnoringIfNotExists(files, cleanupTimeout)
            BlobContainer-->>RoutingService: success
            RoutingService-->>RCSCM: success
            
            RCSCM->>RCSCM: deleteStalePaths (manifest files)
        else Insufficient manifests
            RCSCM->>RCSCM: Skip batch, log warning
        end
    end
    RCSCM->>RCSCM: Update lastCleanupAttemptStateVersion
    RCSCM-->>Scheduler: Cleanup complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Batch deletes in remote segment store #20146: Adds batch deletion API that directly depends on the TimeValue-aware deleteBlobsIgnoringIfNotExists overload introduced in this PR to handle timeout parameters.

Suggested labels

bug, remote-store, cleanup

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.88% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'Implementing batched deletions of stale ClusterMetadataManifests in R...' clearly summarizes the main change: implementing batched deletions as the primary solution to the remote cluster state cleanup failures.
Description check	✅ Passed	The PR description thoroughly details the problem context, specific fixes implemented (batching, configurable timeouts, workflow reordering, logging changes), and references the related issue `#20564`.
Linked Issues check	✅ Passed	The PR fully addresses objectives from issue `#20564`: implements batched deletions with configurable batch_size/max_batches, makes timeout configurable, fixes deletion order (manifests after IndexRouting paths), prevents moving lastCleanupAttemptStateVersion on failure, and reduces logging overhead.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to addressing issue `#20564`: batch deletion infrastructure, timeout configuration, deletion order fixes, state tracking fixes, and related logging optimizations; no unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-06T17:01:36Z

❌ Gradle check result for 7cd15b5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-02-06T18:28:10Z

❌ Gradle check result for 05a8083: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Pranshu-S · 2026-02-07T02:33:56Z

Flaky test -

[org.opensearch.example.systemingestprocessor.ExampleSystemIngestProcessorClientYamlTestSuiteIT.test {yaml=example-system-ingest-processor/20_system_ingest_processor/Processor injects field when index is created from matching template where trigger_setting is true}](https://build.ci.opensearch.org/job/gradle-check/71055/testReport/junit/org.opensearch.example.systemingestprocessor/ExampleSystemIngestProcessorClientYamlTestSuiteIT/test__yaml_example_system_ingest_processor_20_system_ingest_processor_Processor_injects_field_when_index_is_created_from_matching_template_where_trigger_setting_is_true_/)

github-actions · 2026-02-07T03:43:12Z

❌ Gradle check result for ceae963: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Pranshu-S · 2026-02-07T03:47:17Z

Looks like some issues with BwC tests

» [2026-02-07T03:11:30,001][INFO ][o.o.c.s.ClusterManagerService] [v2.19.5-remote-0] Tasks batched with key: org.opensearch.cluster.coordination.JoinHelper, count:4 and sample tasks: elected-as-cluster-manager ([2] nodes joined)[{v2.19.5-remote-0}{QCAW9Wz-S9eqbEMAwegJKg}{ioB2VK4DRL-C-xSgEFmPeQ}{127.0.0.1}{127.0.0.1:35119}{dimr}{upgraded=true, testattr=test, shard_indexing_pressure_enabled=true} elect leader, {v2.19.5-remote-2}{i4ANL5k_TJCv-W2I-WjMeQ}{nsQZoJMFR3WL0pwdo-RlJw}{127.0.0.1}{127.0.0.1:41821}{dimr}{testattr=test, shard_indexing_pressure_enabled=true} elect leader, _BECOME_CLUSTER_MANAGER_TASK_, _FINISH_ELECTION_], term: 53, version: 345, delta: cluster-manager node changed {previous [], current [{v2.19.5-remote-0}{QCAW9Wz-S9eqbEMAwegJKg}{ioB2VK4DRL-C-xSgEFmPeQ}{127.0.0.1}{127.0.0.1:35119}{dimr}{upgraded=true, testattr=test, shard_indexing_pressure_enabled=true}]}
» [2026-02-07T03:11:30,021][INFO ][o.o.c.c.FollowersChecker ] [v2.19.5-remote-0] FollowerChecker{discoveryNode={v2.19.5-remote-1}{t0pClBcpTbqa1UYnnxSLow}{Vdf3ZRR-QZOC6KkSSKFmiw}{127.0.0.1}{127.0.0.1:43807}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, failureCountSinceLastSuccess=1, [cluster.fault_detection.follower_check.retry_count]=3} disconnected
»  org.opensearch.transport.NodeNotConnectedException: [v2.19.5-remote-1][127.0.0.1:43807] Node not connected
»  	at org.opensearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:223)
»  	at org.opensearch.transport.TransportService.getConnection(TransportService.java:990)
»  	at org.opensearch.transport.TransportService.sendRequest(TransportService.java:949)

github-actions · 2026-02-07T05:15:36Z

✅ Gradle check result for c4d48a5: SUCCESS

codecov · 2026-02-07T05:16:21Z

Codecov Report

❌ Patch coverage is 78.76106% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.26%. Comparing base (db0a16d) to head (c04e810).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...teway/remote/RemoteClusterStateCleanupManager.java	79.54%	15 Missing and 3 partials ⚠️
...ting/remote/InternalRemoteRoutingTableService.java	76.00%	4 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #20566      +/-   ##
============================================
+ Coverage     73.19%   73.26%   +0.06%     
- Complexity    71924    72012      +88     
============================================
  Files          5781     5781              
  Lines        329292   329393     +101     
  Branches      47514    47525      +11     
============================================
+ Hits         241026   241315     +289     
+ Misses        68925    68738     -187     
+ Partials      19341    19340       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>

github-actions · 2026-02-15T16:35:27Z

❌ Gradle check result for 33f7d63: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>

github-actions · 2026-02-16T04:14:31Z

✅ Gradle check result for c04e810: SUCCESS

Bukhtawar

Thanks for the changes, looks good to me

opensearch-project#20566) * Implementing batched deletions of stale ClusterMetadataManifests in RemoteClusterStateCleanupManager and adjusting the timeouts Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>

github-actions bot added bug Something isn't working Cluster Manager labels Feb 7, 2026

github-project-automation bot added this to Cluster Manager Project Board Feb 7, 2026

Pranshu-S marked this pull request as ready for review February 7, 2026 06:36

Pranshu-S requested review from Bukhtawar, CEHENKLE, Rishikesh1159, anasalkouz, andrross, ashking94, cwperks, dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, owaiskazi19, reta, sachinpkale, saratvemulapalli and shwetathareja as code owners February 7, 2026 06:36

Retry Build

33f7d63

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>

Retry Build

c04e810

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>

Bukhtawar approved these changes Feb 16, 2026

View reviewed changes

github-project-automation bot moved this to 👀 In review in Cluster Manager Project Board Feb 16, 2026

Bukhtawar merged commit 4b02d64 into opensearch-project:main Feb 16, 2026
35 checks passed

github-project-automation bot moved this from 👀 In review to ✅ Done in Cluster Manager Project Board Feb 16, 2026

andrross mentioned this pull request Mar 21, 2026

Add release notes for 3.6.0 andrross/OpenSearch#215

Closed

1 task

opensearch-ci-bot mentioned this pull request Mar 21, 2026

[AUTOCUT] Gradle Check Flaky Test Report for OpenSearchTestBasePluginFuncTest #20955

Closed

Conversation

Pranshu-S commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What this PR does

Related Issues

Check List

Uh oh!

coderabbitai bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

Pranshu-S commented Feb 7, 2026

Uh oh!

github-actions bot commented Feb 7, 2026

Uh oh!

Pranshu-S commented Feb 7, 2026

Uh oh!

github-actions bot commented Feb 7, 2026

Uh oh!

codecov bot commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Feb 15, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

Bukhtawar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Pranshu-S commented Feb 6, 2026 •

edited

Loading

coderabbitai bot commented Feb 6, 2026 •

edited

Loading

codecov bot commented Feb 7, 2026 •

edited

Loading