Skip to content

Implementing batched deletions of stale ClusterMetadataManifests in R…#20566

Merged
Bukhtawar merged 13 commits intoopensearch-project:mainfrom
Pranshu-S:remote-state-cleanup-manager-batched-deletions-and-timeouts-pr
Feb 16, 2026
Merged

Implementing batched deletions of stale ClusterMetadataManifests in R…#20566
Bukhtawar merged 13 commits intoopensearch-project:mainfrom
Pranshu-S:remote-state-cleanup-manager-batched-deletions-and-timeouts-pr

Conversation

@Pranshu-S
Copy link
Copy Markdown
Contributor

@Pranshu-S Pranshu-S commented Feb 6, 2026

Description

Fixes remote cluster state cleanup failures that were causing stale metadata pile-ups in remote storage when deletions time out along with fixes mention in tagged GitHub Issue

Issue context (#20564): RemoteClusterStateCleanupManager runs every ~5 minutes (configurable) and performs sequential deletions (global metadata → index metadata → ephemeral attrs → manifests). A recent change added a 30s timeout to the S3 “sync” delete path; with large delete sets this can throw IOException, abort the whole cleanup run (single try/catch), and leave later phases undeleted—making the next run even larger and more likely to fail.

What this PR does

  • Batch deletes stale manifest* blobs to reduce per-call delete size and avoid timeout/payload issues.
    • Controlled by:
      • cluster.remote_store.state.cleanup.batch_size
      • cluster.remote_store.state.cleanup.max_batches
  • Makes delete timeout configurable for remote-state cleanup
  • Fixing issues where the update-interval call was not being honoured when the deletion was disabled (set to -1)
  • Fixing mis-match in deletions workflow where we were deleting manifests before the the IndexRouting paths which could lead to IndexRouting paths being dangled
  • Fixing issue where we were moving the lastCleanupAttemptStateVersion even though the cleanup were failing which resulted in next deletes to become no-op if there were no cluster-state changes (or less than 10) and previous deletions failed.
  • Reducing logging overhead of DEBUG as it was printing every single file to delete. Moved it to TRACE.

Related Issues

Resolves #20564

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 6, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds configurable timeout and batched deletion support for remote cluster state cleanup operations. New settings control batch size, maximum batches, and deletion timeout. Timeout parameters flow through the deletion API chain from cleanup manager through transfer service to blob containers. S3BlobContainer gains configurable timeout support instead of hardcoded 30-second timeouts.

Changes

Cohort / File(s) Summary
S3 Blob Container Timeout Support
plugins/repository-s3/src/main/java/.../S3BlobContainer.java, plugins/repository-s3/src/test/java/.../S3BlobContainerTimeoutTests.java
Added getFutureValue(T, TimeValue timeout) method and deleteBlobsIgnoringIfNotExists(List<String>, TimeValue timeout) overload. Replaced hardcoded 30-second timeout with configurable timeout parameter. Updated error messages to reflect actual timeout values. New test class validates timeout propagation and behavior.
Blob Container API Extensions
server/src/main/java/.../BlobContainer.java, server/src/main/java/.../EncryptedBlobContainer.java, server/src/test/java/.../BlobContainerTests.java, server/src/test/java/.../EncryptedBlobContainerTests.java
Added default deleteBlobsIgnoringIfNotExists(List<String>, TimeValue timeout) method to BlobContainer interface and concrete implementation in EncryptedBlobContainer. New tests verify timeout delegation and default handling.
Transfer Service Timeout Overloads
server/src/main/java/.../TransferService.java, server/src/main/java/.../BlobStoreTransferService.java, server/src/test/java/.../BlobStoreTransferServiceTests.java
Added deleteBlobs(Iterable<String>, List<String>, TimeValue timeout) default method to TransferService interface and implementation in BlobStoreTransferService. New test validates timeout-aware deletion delegation.
Remote Routing Table Service Timeout Methods
server/src/main/java/.../RemoteRoutingTableService.java, server/src/main/java/.../InternalRemoteRoutingTableService.java, server/src/main/java/.../NoopRemoteRoutingTableService.java, server/src/test/java/.../RemoteRoutingTableServiceTests.java
Added timeout-accepting overloads for deleteStaleIndexRoutingPaths and deleteStaleIndexRoutingDiffPaths across the service hierarchy. Interface provides default unsupported exception; implementations delegate with timeout. Four new test methods verify timeout parameter propagation.
Cleanup Configuration and Settings
server/src/main/java/.../RemoteClusterStateCleanupManager.java, server/src/main/java/.../ClusterSettings.java
Added three new settings: REMOTE_CLUSTER_STATE_CLEANUP_BATCH_SIZE_SETTING, REMOTE_CLUSTER_STATE_CLEANUP_TIMEOUT_SETTING, REMOTE_CLUSTER_STATE_CLEANUP_MAX_BATCHES_SETTING. Wired runtime configuration with dynamic update consumers. Registered settings in ClusterSettings.
Cleanup Manager Batched Deletion
server/src/main/java/.../RemoteClusterStateCleanupManager.java, server/src/internalClusterTest/java/.../RemoteClusterStateCleanupManagerIT.java, server/src/test/java/.../RemoteClusterStateCleanupManagerTests.java
Refactored deleteStaleClusterMetadata to process manifests in configurable batches instead of all-at-once. Added batch counter, condition checks against manifest retention threshold, and exhaustion logic. Integrated cleanupTimeout into deleteStalePaths calls. Made getBlobStoreTransferService() package-private for testing. Updated tests with timeout awareness and new integration test for multi-batch deletion.

Sequence Diagram

sequenceDiagram
    actor Scheduler
    participant RCSCM as RemoteClusterStateCleanupManager
    participant RoutingService as InternalRemoteRoutingTableService
    participant TransferService as BlobStoreTransferService
    participant BlobContainer as BlobContainer
    participant S3 as S3BlobContainer

    Scheduler->>RCSCM: deleteStaleClusterMetadata(clusterName, UUID, retainCount)
    note over RCSCM: Load manifest files
    loop For each batch (limited by cleanupMaxBatches)
        RCSCM->>RCSCM: Extract next batch (cleanupBatchSize)
        alt Batch manifests exceed retainCount
            RCSCM->>RCSCM: deleteStalePaths (global metadata)
            RCSCM->>TransferService: deleteBlobs(path, files, cleanupTimeout)
            TransferService->>BlobContainer: deleteBlobsIgnoringIfNotExists(files, cleanupTimeout)
            BlobContainer->>S3: deleteBlobsIgnoringIfNotExists(files, cleanupTimeout)
            S3->>S3: getFutureValue(future, cleanupTimeout)
            S3-->>BlobContainer: deletion complete
            BlobContainer-->>TransferService: success
            TransferService-->>RCSCM: success
            
            RCSCM->>RoutingService: deleteStaleIndexRoutingPaths(paths, cleanupTimeout)
            RoutingService->>TransferService: deleteBlobs(path, files, cleanupTimeout)
            TransferService->>BlobContainer: deleteBlobsIgnoringIfNotExists(files, cleanupTimeout)
            BlobContainer-->>RoutingService: success
            RoutingService-->>RCSCM: success
            
            RCSCM->>RCSCM: deleteStalePaths (manifest files)
        else Insufficient manifests
            RCSCM->>RCSCM: Skip batch, log warning
        end
    end
    RCSCM->>RCSCM: Update lastCleanupAttemptStateVersion
    RCSCM-->>Scheduler: Cleanup complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

bug, remote-store, cleanup

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'Implementing batched deletions of stale ClusterMetadataManifests in R...' clearly summarizes the main change: implementing batched deletions as the primary solution to the remote cluster state cleanup failures.
Description check ✅ Passed The PR description thoroughly details the problem context, specific fixes implemented (batching, configurable timeouts, workflow reordering, logging changes), and references the related issue #20564.
Linked Issues check ✅ Passed The PR fully addresses objectives from issue #20564: implements batched deletions with configurable batch_size/max_batches, makes timeout configurable, fixes deletion order (manifests after IndexRouting paths), prevents moving lastCleanupAttemptStateVersion on failure, and reduces logging overhead.
Out of Scope Changes check ✅ Passed All changes are directly scoped to addressing issue #20564: batch deletion infrastructure, timeout configuration, deletion order fixes, state tracking fixes, and related logging optimizations; no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 6, 2026

❌ Gradle check result for 7cd15b5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 6, 2026

❌ Gradle check result for 05a8083: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@Pranshu-S
Copy link
Copy Markdown
Contributor Author

Flaky test -

[org.opensearch.example.systemingestprocessor.ExampleSystemIngestProcessorClientYamlTestSuiteIT.test {yaml=example-system-ingest-processor/20_system_ingest_processor/Processor injects field when index is created from matching template where trigger_setting is true}](https://build.ci.opensearch.org/job/gradle-check/71055/testReport/junit/org.opensearch.example.systemingestprocessor/ExampleSystemIngestProcessorClientYamlTestSuiteIT/test__yaml_example_system_ingest_processor_20_system_ingest_processor_Processor_injects_field_when_index_is_created_from_matching_template_where_trigger_setting_is_true_/)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 7, 2026

❌ Gradle check result for ceae963: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@Pranshu-S
Copy link
Copy Markdown
Contributor Author

Looks like some issues with BwC tests

» [2026-02-07T03:11:30,001][INFO ][o.o.c.s.ClusterManagerService] [v2.19.5-remote-0] Tasks batched with key: org.opensearch.cluster.coordination.JoinHelper, count:4 and sample tasks: elected-as-cluster-manager ([2] nodes joined)[{v2.19.5-remote-0}{QCAW9Wz-S9eqbEMAwegJKg}{ioB2VK4DRL-C-xSgEFmPeQ}{127.0.0.1}{127.0.0.1:35119}{dimr}{upgraded=true, testattr=test, shard_indexing_pressure_enabled=true} elect leader, {v2.19.5-remote-2}{i4ANL5k_TJCv-W2I-WjMeQ}{nsQZoJMFR3WL0pwdo-RlJw}{127.0.0.1}{127.0.0.1:41821}{dimr}{testattr=test, shard_indexing_pressure_enabled=true} elect leader, _BECOME_CLUSTER_MANAGER_TASK_, _FINISH_ELECTION_], term: 53, version: 345, delta: cluster-manager node changed {previous [], current [{v2.19.5-remote-0}{QCAW9Wz-S9eqbEMAwegJKg}{ioB2VK4DRL-C-xSgEFmPeQ}{127.0.0.1}{127.0.0.1:35119}{dimr}{upgraded=true, testattr=test, shard_indexing_pressure_enabled=true}]}
» [2026-02-07T03:11:30,021][INFO ][o.o.c.c.FollowersChecker ] [v2.19.5-remote-0] FollowerChecker{discoveryNode={v2.19.5-remote-1}{t0pClBcpTbqa1UYnnxSLow}{Vdf3ZRR-QZOC6KkSSKFmiw}{127.0.0.1}{127.0.0.1:43807}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, failureCountSinceLastSuccess=1, [cluster.fault_detection.follower_check.retry_count]=3} disconnected
»  org.opensearch.transport.NodeNotConnectedException: [v2.19.5-remote-1][127.0.0.1:43807] Node not connected
»  	at org.opensearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:223)
»  	at org.opensearch.transport.TransportService.getConnection(TransportService.java:990)
»  	at org.opensearch.transport.TransportService.sendRequest(TransportService.java:949)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 7, 2026

✅ Gradle check result for c4d48a5: SUCCESS

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 7, 2026

Codecov Report

❌ Patch coverage is 78.76106% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.26%. Comparing base (db0a16d) to head (c04e810).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...teway/remote/RemoteClusterStateCleanupManager.java 79.54% 15 Missing and 3 partials ⚠️
...ting/remote/InternalRemoteRoutingTableService.java 76.00% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20566      +/-   ##
============================================
+ Coverage     73.19%   73.26%   +0.06%     
- Complexity    71924    72012      +88     
============================================
  Files          5781     5781              
  Lines        329292   329393     +101     
  Branches      47514    47525      +11     
============================================
+ Hits         241026   241315     +289     
+ Misses        68925    68738     -187     
+ Partials      19341    19340       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 33f7d63: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for c04e810: SUCCESS

Copy link
Copy Markdown
Contributor

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, looks good to me

@Bukhtawar Bukhtawar merged commit 4b02d64 into opensearch-project:main Feb 16, 2026
35 checks passed
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Cluster Manager Project Board Feb 16, 2026
This was referenced Feb 18, 2026
tanyabti pushed a commit to tanyabti/OpenSearch that referenced this pull request Feb 24, 2026
opensearch-project#20566)

* Implementing batched deletions of stale ClusterMetadataManifests in RemoteClusterStateCleanupManager and adjusting the timeouts

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
tanyabti pushed a commit to tanyabti/OpenSearch that referenced this pull request Feb 24, 2026
opensearch-project#20566)

* Implementing batched deletions of stale ClusterMetadataManifests in RemoteClusterStateCleanupManager and adjusting the timeouts

Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Cluster Manager

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

[BUG] Remote Cluster State cleanup failures due to deletions timeouts resulting in stale-metadata pile-ups

5 participants