Implementing batched deletions of stale ClusterMetadataManifests in R…#20566
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR adds configurable timeout and batched deletion support for remote cluster state cleanup operations. New settings control batch size, maximum batches, and deletion timeout. Timeout parameters flow through the deletion API chain from cleanup manager through transfer service to blob containers. S3BlobContainer gains configurable timeout support instead of hardcoded 30-second timeouts. Changes
Sequence DiagramsequenceDiagram
actor Scheduler
participant RCSCM as RemoteClusterStateCleanupManager
participant RoutingService as InternalRemoteRoutingTableService
participant TransferService as BlobStoreTransferService
participant BlobContainer as BlobContainer
participant S3 as S3BlobContainer
Scheduler->>RCSCM: deleteStaleClusterMetadata(clusterName, UUID, retainCount)
note over RCSCM: Load manifest files
loop For each batch (limited by cleanupMaxBatches)
RCSCM->>RCSCM: Extract next batch (cleanupBatchSize)
alt Batch manifests exceed retainCount
RCSCM->>RCSCM: deleteStalePaths (global metadata)
RCSCM->>TransferService: deleteBlobs(path, files, cleanupTimeout)
TransferService->>BlobContainer: deleteBlobsIgnoringIfNotExists(files, cleanupTimeout)
BlobContainer->>S3: deleteBlobsIgnoringIfNotExists(files, cleanupTimeout)
S3->>S3: getFutureValue(future, cleanupTimeout)
S3-->>BlobContainer: deletion complete
BlobContainer-->>TransferService: success
TransferService-->>RCSCM: success
RCSCM->>RoutingService: deleteStaleIndexRoutingPaths(paths, cleanupTimeout)
RoutingService->>TransferService: deleteBlobs(path, files, cleanupTimeout)
TransferService->>BlobContainer: deleteBlobsIgnoringIfNotExists(files, cleanupTimeout)
BlobContainer-->>RoutingService: success
RoutingService-->>RCSCM: success
RCSCM->>RCSCM: deleteStalePaths (manifest files)
else Insufficient manifests
RCSCM->>RCSCM: Skip batch, log warning
end
end
RCSCM->>RCSCM: Update lastCleanupAttemptStateVersion
RCSCM-->>Scheduler: Cleanup complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
❌ Gradle check result for 7cd15b5: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 05a8083: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Flaky test - |
|
❌ Gradle check result for ceae963: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Looks like some issues with BwC tests |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #20566 +/- ##
============================================
+ Coverage 73.19% 73.26% +0.06%
- Complexity 71924 72012 +88
============================================
Files 5781 5781
Lines 329292 329393 +101
Branches 47514 47525 +11
============================================
+ Hits 241026 241315 +289
+ Misses 68925 68738 -187
+ Partials 19341 19340 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
|
❌ Gradle check result for 33f7d63: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
Bukhtawar
left a comment
There was a problem hiding this comment.
Thanks for the changes, looks good to me
opensearch-project#20566) * Implementing batched deletions of stale ClusterMetadataManifests in RemoteClusterStateCleanupManager and adjusting the timeouts Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
opensearch-project#20566) * Implementing batched deletions of stale ClusterMetadataManifests in RemoteClusterStateCleanupManager and adjusting the timeouts Signed-off-by: Pranshu Shukla <pranshushukla06@gmail.com>
Description
Fixes remote cluster state cleanup failures that were causing stale metadata pile-ups in remote storage when deletions time out along with fixes mention in tagged GitHub Issue
Issue context (#20564):
RemoteClusterStateCleanupManagerruns every ~5 minutes (configurable) and performs sequential deletions (global metadata → index metadata → ephemeral attrs → manifests). A recent change added a 30s timeout to the S3 “sync” delete path; with large delete sets this can throwIOException, abort the whole cleanup run (single try/catch), and leave later phases undeleted—making the next run even larger and more likely to fail.What this PR does
manifest*blobs to reduce per-call delete size and avoid timeout/payload issues.cluster.remote_store.state.cleanup.batch_sizecluster.remote_store.state.cleanup.max_batcheslastCleanupAttemptStateVersioneven though the cleanup were failing which resulted in next deletes to become no-op if there were no cluster-state changes (or less than 10) and previous deletions failed.Related Issues
Resolves #20564
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.