[merged segment warmer] Introduce merged segment checkpoint retention to avoid listAll in computeReferencedSegmentsCheckpoint by guojialiang92 · Pull Request #18890 · opensearch-project/OpenSearch

guojialiang92 · 2025-08-01T07:57:36Z

Description

This PR is based on [18720]'s follow-up work. Based on the discussion with @ashking94, I made some optimizations. The main purpose is to avoid the listAll operation in IndexShard#computeReferencedSegmentsCheckpoint.

Main Changes

I Introduced IndexShard#primaryMergedSegmentCheckpoints for primary shard to track merged segment checkpoints that have been published for pre-warm. We will add merged segment checkpoint to IndexShard#primaryMergedSegmentCheckpoints before publish_merged_segment.
To avoid memory leaks in IndexShard#primaryMergedSegmentCheckpoints, I introduced a configuration IndexSettings#INDEX_MERGED_SEGMENT_CHECKPOINT_RETENTION_TIME (default 15 minutes) to clean up expired checkpoints after publish_referenced_segments.
Rename IndexShard#pendingMergedSegmentCheckpoints to IndexShard#replicaMergedSegmentCheckpoints, which is used for replica shard record the pre-copied merged segment checkpoints, which are not yet refreshed.
I introduced MergedSegmentWarmerIT#testPrimaryMergedSegmentCheckpointRetentionTimeout. Construct a case with redundant merged segment checkpoint in the primary shard and delete it based on the expiration time.

Concurrent Control

After a relatively long period of verification in the production environment, we have also fixed some issues. To enhance the robustness of the new logic, I introduced concurrent control for cleaning merged segments and segment replication.

We replaced the higher-overhead listAll operation with primaryMergedSegmentCheckpoints and refreshed segment recorded in memory. However, primaryMergedSegmentCheckpoints will be deleted based on the expiration time. Considering the situation where the merged segment ckp of primaryMergedSegmentCheckpoints has not been refreshed and has already expired, it will not be included in ReferencedSegmentsCheckpoint. However, once it is refreshed, segment replication will reference it. Since the merged segment cleanup task and the segment replication task are executed concurrently, it may lead to the problem that the merged segment referenced by segment replication is incorrectly deleted.

This occurs when the merge segment wamer takes a long time and exceeds the expiration time. Although increasing the expiration time can reduce the probability of occurrence, we also need a more robust design.

Through synchronous control, we can ensure that the merged segment referenced by the current segment replication task will definitely not be deleted by the merged segment cleanup task. If the merged segment is deleted by the cleanup task before the segment replication task starts execution, these files will be sent to replica shard again through segment replication.

Summary

The correctness of the task to clean up redundant merged segments in the replica shard depends on the accuracy of obtaining the ReferencedSegmentsCheckpoint of the primary shard. ReferencedSegmentsCheckpoint consists of two parts, primaryMergedSegmentCheckpoints and refreshed segments.

After the replica node receives ReferencedSegmentsCheckpoint, it will iterate through replicaMergedSegmentCheckpoints. If ReferencedSegmentsCheckpoint does not contain MergedSegmentCheckpoint and ReferencedSegmentsCheckpoint is ahead of MergedSegmentCheckpoint, then MergedSegmentCheckpoint will be deleted.

For a clearer explanation, I will elaborate on different scenarios.

The merged segment ckp of primaryMergedSegmentCheckpoints has not expired yet. ReferencedSegmentsCheckpoint will contain it, and the merged segment in the replicas will not be deleted.
The merged segment ckp of primaryMergedSegmentCheckpoints has been refreshed and has already expired. ReferencedSegmentsCheckpoint will contain it, and the merged segment in the replicas will not be deleted.
The merged segment ckp of primaryMergedSegmentCheckpoints has not been refreshed and has already expired. ReferencedSegmentsCheckpoint will not contain it. Through the above concurrency control, it is necessary to avoid incorrectly deleting the merged segment referenced by segment replication.
The merged segment ckp of primaryMergedSegmentCheckpoints has already been deleted due to being merged again and has already expired. ReferencedSegmentsCheckpoint will not contain it. The replica shard can safely delete the redundant merged segment.

Related Issues

Resolves #[18845]

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

New Features
- Configurable merged-segment checkpoint retention time (default 15 minutes) exposed as a cluster setting.
Improvements
- Distinct primary vs. replica merged-segment tracking with expiration/pruning for primary checkpoints.
- Improved cleanup and synchronization of merged-segment metadata and files; replica state cleared on shard close/promotion.
- Index shard now tracks current segment replication checkpoint for coordination.
Tests
- Added/renamed tests covering retention timeout and primary–replica file synchronization.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

github-actions · 2025-08-01T08:55:24Z

❌ Gradle check result for 0c06302: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

github-actions · 2025-08-01T10:05:39Z

❌ Gradle check result for 253c15f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…t-merged-segment-cleanup

github-actions · 2025-08-02T15:42:35Z

❌ Gradle check result for 1c90ffc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

github-actions · 2025-12-12T14:58:57Z

❌ Gradle check result for 0764133: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (5)

399-409: Checkpoint-tracking state split looks reasonable; consider documenting thread-safety expectations.

The primary/replica split plus a dedicated mutex for “replica cleanup vs in-flight replication” is a good direction. One small suggestion: add a brief comment clarifying which operations must synchronize on cleanupReplicaMergedSegmentMutex (reads/writes of currentSegmentReplicationCheckpoint, and only that), since replicaMergedSegmentCheckpoints is concurrently mutated outside the lock (by design).

1814-1830: Replica refreshed-segment cleanup: LGTM; tweak log message wording.

The refreshedSegmentNames set + removeIf is clear and keeps the set in sync after refresh. The log still says “pending merged segments” even though the structure is now replicaMergedSegmentCheckpoints; consider renaming the message for easier debugging.

1904-1909: Retention cleanup implementation is fine once time units are confirmed.

Using removeIf on the concurrent set is straightforward and keeps the “no listAll” goal intact. After unit verification (see earlier comment), consider renaming locals to reflect the actual unit (e.g., retentionNanos, nowEpochNanos, createdEpochNanos) to avoid future regressions.

1911-1963: Public API shape looks safe; consider minimizing surface area if truly “tests only”.

Deprecated getter returning Collections.emptySet() is the safe choice.

Collections.unmodifiableSet(...) prevents external mutation (good), but these getters are still public. If they’re strictly test-only, consider @VisibleForTesting + package-private (where feasible) to reduce API surface.

Also: clearReplicaMergedSegmentState() correctly synchronizes (fixes visibility/order).

2071-2085: Referenced segments union logic is clear; minor micro-optimization available.

Logic is correct: include both refreshed segments and “published-for-prewarm” segment names. Minor: segmentInfosGatedCloseable.get() is invoked multiple times; caching to a local SegmentInfos si = ... would reduce repeated calls and reads better.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0764133 and f2e0d8a.

📒 Files selected for processing (1)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (9 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: gradle-check
GitHub Check: precommit (25, macos-15)
GitHub Check: precommit (21, windows-latest)
GitHub Check: precommit (21, ubuntu-24.04-arm)
GitHub Check: precommit (21, macos-15-intel)
GitHub Check: precommit (21, ubuntu-latest)
GitHub Check: precommit (21, macos-15)
GitHub Check: precommit (25, windows-latest)
GitHub Check: precommit (25, ubuntu-latest)
GitHub Check: precommit (25, ubuntu-24.04-arm)
GitHub Check: precommit (25, macos-15-intel)
GitHub Check: precommit (21, windows-2025, true)
GitHub Check: detect-breaking-change
GitHub Check: assemble (25, windows-latest)
GitHub Check: assemble (25, ubuntu-24.04-arm)
GitHub Check: assemble (25, ubuntu-latest)
GitHub Check: assemble (21, ubuntu-latest)
GitHub Check: assemble (21, ubuntu-24.04-arm)
GitHub Check: assemble (21, windows-latest)
GitHub Check: Analyze (java)

🔇 Additional comments (3)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (3)

1871-1902: Deletion gate with currentSegmentReplicationCheckpoint is good; verify metadata-map intersection is sufficient.

The synchronized “don’t delete if any file is referenced by the current replication checkpoint” is the right shape for preventing races with concurrent segment-replication tasks. Please sanity-check that:

ReplicationCheckpoint#getMetadataMap() keys overlap the same filenames as MergedSegmentCheckpoint#getMetadataMap() (i.e., the intersection test is meaningful), and

all in-flight references are captured by currentSegmentReplicationCheckpoint at the time deletion can happen.

If either assumption breaks, this could delete files still needed by an active replication.

2055-2062: finally cleanup on publish: good resiliency.

Ensuring removeExpiredPrimaryMergedSegmentCheckpoints() runs even on publish failures matches the “avoid leaks” objective and avoids coupling cleanup to successful publish.

2088-2098: Rollback on publish failure: good.

Adding to primaryMergedSegmentCheckpoints before publishing and removing on failure prevents stale entries from influencing referenced-segment computation for the full retention window.

server/src/main/java/org/opensearch/index/shard/IndexShard.java

github-actions · 2025-12-12T17:45:37Z

✅ Gradle check result for f2e0d8a: SUCCESS

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (2)
1814-1830: Minor: log message still says “pending merged segments” after rename.
The message at Lines 1825-1826 reads like the old terminology; consider updating it to “replica merged segment checkpoints” (or similar) to match the new model.

1871-1902: Prevent TOCTOU deletion: re-check membership under mutex before deleting files.
pendingDeleteCheckpoints is built outside cleanupReplicaMergedSegmentMutex, so a checkpoint can be removed/updated concurrently (e.g., by refresh/finalize paths) before you call store.deleteQuiet(...). Re-checking replicaMergedSegmentCheckpoints.contains(...) under the mutex makes the deletion decision robust against concurrent state changes.

Proposed diff:
 synchronized (cleanupReplicaMergedSegmentMutex) {
     for (MergedSegmentCheckpoint mergedSegmentCheckpoint : pendingDeleteCheckpoints) {
+        if (replicaMergedSegmentCheckpoints.contains(mergedSegmentCheckpoint) == false) {
+            continue;
+        }
         if ((currentSegmentReplicationCheckpoint == null)
             || mergedSegmentCheckpoint.getMetadataMap()
                 .keySet()
                 .stream()
                 .noneMatch(currentSegmentReplicationCheckpoint.getMetadataMap()::containsKey)) {
             logger.trace("replica shard remove redundant pending merge segment {}", mergedSegmentCheckpoint.getSegmentName());
             store.deleteQuiet(mergedSegmentCheckpoint.getMetadataMap().keySet().toArray(new String[0]));
             replicaMergedSegmentCheckpoints.remove(mergedSegmentCheckpoint);
         } else {
             logger.trace(
                 "replica shard can not remove redundant pending merge segment {}",
                 mergedSegmentCheckpoint.getSegmentName()
             );
         }
     }
 }

♻️ Duplicate comments (1)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)
99-101: Fix/verify retention timestamp units (currently looks inconsistent: “nowNanos” vs DateUtils.toLong(Instant.now())).
Right now removeExpiredPrimaryMergedSegmentCheckpoints() subtracts nowNanos - createdTimeStamp and compares against retentionTimeInNanos, but the correctness hinges entirely on DateUtils.toLong(Instant) and MergedSegmentCheckpoint#getCreatedTimeStamp() having the same unit and time source. This was previously raised and still seems unresolved.

Suggested direction once units are confirmed:

If timestamps are epoch-millis: compare using retentionTime.getMillis() and rename vars to *Millis.

If timestamps are epoch-nanos: ensure both are epoch-nanos (and rename to *EpochNanos).

If timestamps are System.nanoTime(): do not mix with Instant.now(); use the same monotonic source for both.

Verification script to pin this down:
#!/bin/bash
set -euo pipefail

# What unit does DateUtils.toLong(Instant) return?
rg -n --hidden --no-mmap 'class DateUtils\b|long\s+toLong\s*\(\s*Instant' -S .

# What unit/source is MergedSegmentCheckpoint createdTimeStamp?
rg -n --hidden --no-mmap 'class MergedSegmentCheckpoint\b|getCreatedTimeStamp\s*\(|createdTimeStamp' -S server/src/main/java

# Where is createdTimeStamp assigned (Instant.now/System.currentTimeMillis/System.nanoTime)?
rg -n --hidden --no-mmap 'getCreatedTimeStamp\s*\(|createdTimeStamp\s*=|Instant\.now|System\.currentTimeMillis|System\.nanoTime' -S server/src/main/java
Also applies to: 229-231, 1904-1909, 1923-1925

🧹 Nitpick comments (3)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (3)

399-409: Make nullability/guardrails explicit for replica-tracking state.
Consider marking currentSegmentReplicationCheckpoint as @Nullable (it’s explicitly set to null) to make the contract clearer at call sites and during future refactors.

1919-1949: Nice: primary/replica checkpoint APIs are clear; consider returning snapshots for test getters.
Collections.unmodifiableSet(primaryMergedSegmentCheckpoints) still exposes a live concurrent view (fine), but returning an immutable snapshot can reduce test flakiness due to concurrent mutation.

2071-2085: Behavior change looks correct: referenced segments now include published merged-segment names.
This avoids listAll while still keeping referenced names conservative. Minor nit: you can avoid intermediate Collectors.toSet() by adding via forEach(segmentNames::add) to reduce allocations.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f2e0d8a and e78533c.

📒 Files selected for processing (1)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (9 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)

libs/common/src/main/java/org/opensearch/common/util/set/Sets.java (1)

Sets (57-185)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: gradle-check
GitHub Check: precommit (25, ubuntu-24.04-arm)
GitHub Check: assemble (21, windows-latest)
GitHub Check: precommit (21, windows-2025, true)
GitHub Check: assemble (21, ubuntu-24.04-arm)
GitHub Check: precommit (25, windows-latest)
GitHub Check: assemble (25, ubuntu-24.04-arm)
GitHub Check: precommit (25, macos-15)
GitHub Check: assemble (25, windows-latest)
GitHub Check: precommit (21, ubuntu-latest)
GitHub Check: precommit (21, macos-15-intel)
GitHub Check: precommit (25, ubuntu-latest)
GitHub Check: assemble (25, ubuntu-latest)
GitHub Check: precommit (21, macos-15)
GitHub Check: precommit (25, macos-15-intel)
GitHub Check: assemble (21, ubuntu-latest)
GitHub Check: precommit (21, windows-latest)
GitHub Check: precommit (21, ubuntu-24.04-arm)
GitHub Check: detect-breaking-change
GitHub Check: Analyze (java)

🔇 Additional comments (5)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (5)

1904-1909: Retention cleanup logic is well-placed, but depends on the timestamp-unit fix.
Calling this from publishReferencedSegments() (finally) is a good lifecycle hook; just ensure the time arithmetic is correct per the units/source verification.

1911-1917: Good BWC behavior: deprecated getter returns empty set (not null).
This keeps deprecated API usage safe for any lingering callers.

1951-1963: Mutex usage is consistent for replica state resets/updates.
setCurrentSegmentReplicationCheckpoint() and clearReplicaMergedSegmentState() correctly synchronize on the same mutex used by deletion logic.

2055-2062: Good: retention cleanup runs even if publish fails (finally).
This matches the “avoid memory leaks” goal and preserves exception behavior.

2088-2098: Good rollback on publish failure to avoid stale primary checkpoint state.
This prevents retention-based “ghost” references after publisher exceptions.

github-actions · 2025-12-13T04:25:12Z

❌ Gradle check result for e78533c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-12-15T04:59:07Z

❌ Gradle check result for 8a513f5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)
1905-1910: Clarify/lock in timestamp units for retention math

removeExpiredPrimaryMergedSegmentCheckpoints() assumes that:

nowNanos = DateUtils.toLong(Instant.now()), and

c.v2() (from primaryMergedSegmentCheckpoints)

are in the same time unit suitable for Duration.ofNanos(nowNanos - c.v2()).

The comment above the field mentions createdTimeStampInNanos, which suggests these longs represent nanoseconds, but that contract currently lives only in comments and in the implementation of MergedSegmentCheckpoint#getCreatedTimeStamp() / DateUtils.toLong(Instant).

To prevent subtle bugs if either side ever changes representation, I’d suggest:

Renaming the variable/tuple component to something like createdTimeNanos and nowNanosSinceEpoch (or similar) to make units explicit, and/or

Adding a short Javadoc on getCreatedTimeStamp() explicitly stating its unit, and referencing that here.

This is mainly about future maintainability; behavior looks consistent with the current comments and tests.

You can double‑check the unit consistency with a quick grep:
#!/bin/bash
set -euo pipefail

# Inspect how createdTimeStamp is produced and documented
rg -n "class MergedSegmentCheckpoint\b" -S .
rg -n "getCreatedTimeStamp\s*\(" -S .
rg -n "DateUtils\.toLong\s*\(\s*Instant" -S .

# Sanity-check that tests cover expiration behavior
rg -n "MergedSegmentCheckpointRetention" -S .

🧹 Nitpick comments (3)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (3)

1873-1903: Replica redundant-merge cleanup logic is reasonable; small micro‑cleanup possible

The two‑phase approach (select candidates from replicaMergedSegmentCheckpoints, then re‑check against currentSegmentReplicationCheckpoint under cleanupReplicaMergedSegmentMutex before store.deleteQuiet(...)) correctly avoids races with in‑flight segment replication and ensures visibility of currentSegmentReplicationCheckpoint. The use of a concurrent set means iteration and removals are safe.

If you ever want to shave a bit of overhead, you could cache mergedSegmentCheckpoint.getMetadataMap().keySet() into a local Set<String> per checkpoint so you don’t rebuild the key set and array multiple times, but this is a micro‑optimization and not required.

1912-1950: Deprecated pending-merge API behavior and new checkpoint helpers are safe; consider tightening test API

Turning addPendingMergeSegmentCheckpoint into a no‑op and getPendingMergedSegmentCheckpoints into Collections.emptySet() is a safe BWC strategy for any lingering callers, avoiding NPEs while steering usage to the new primary/replica‑specific APIs.

The new helpers (addPrimaryMergedSegmentCheckpoint, removePrimaryMergedSegmentCheckpoint, addReplicaMergedSegmentCheckpoint, and their getters) are straightforward and correctly use unmodifiable views to avoid external mutation.

If these getters are only used in tests, you might optionally narrow their visibility (e.g., package‑private + @VisibleForTesting) to avoid growing the public surface area of IndexShard, and, if you ever want a stable snapshot instead of a live view, wrap in new HashSet<>(...) before unmodifiableSet(...). Not required, but would tighten the API.

2073-2085: Including primary merged checkpoints in referenced-segment computation looks correct; minor allocation nit

Building segmentNames from:

the live SegmentInfos snapshot, plus

primaryMergedSegmentCheckpoints.stream().map(Tuple::v1)

correctly ensures that pre‑warmed (but not yet fully refreshed) merged segments remain “referenced” from the primary’s point of view. That matches the retention and pre‑warm design described in the PR.

Very minor: you could avoid a couple of intermediate Collectors.toSet() allocations by streaming directly into segmentNames (e.g., infos.asList().forEach(...); primaryMergedSegmentCheckpoints.forEach(...);), but that’s purely cosmetic.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8a513f5 and 7ff4838.

📒 Files selected for processing (1)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (9 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)

libs/common/src/main/java/org/opensearch/common/util/set/Sets.java (1)

Sets (57-185)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: gradle-check
GitHub Check: assemble (21, ubuntu-24.04-arm)
GitHub Check: assemble (25, ubuntu-latest)
GitHub Check: assemble (21, ubuntu-latest)
GitHub Check: assemble (25, ubuntu-24.04-arm)
GitHub Check: assemble (25, windows-latest)
GitHub Check: assemble (21, windows-latest)
GitHub Check: precommit (21, macos-15)
GitHub Check: precommit (21, ubuntu-latest)
GitHub Check: precommit (21, windows-2025, true)
GitHub Check: precommit (25, macos-15)
GitHub Check: precommit (25, macos-15-intel)
GitHub Check: precommit (25, ubuntu-24.04-arm)
GitHub Check: precommit (25, ubuntu-latest)
GitHub Check: precommit (21, windows-latest)
GitHub Check: precommit (21, macos-15-intel)
GitHub Check: precommit (21, ubuntu-24.04-arm)
GitHub Check: precommit (25, windows-latest)
GitHub Check: Analyze (java)
GitHub Check: detect-breaking-change

🔇 Additional comments (5)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (5)

400-410: Primary/replica checkpoint state separation and locking look sound

The split between primaryMergedSegmentCheckpoints and replicaMergedSegmentCheckpoints, plus the dedicated cleanupReplicaMergedSegmentMutex for currentSegmentReplicationCheckpoint, gives a clear concurrency story: primary-only state relies on a concurrent set, while replica cleanup coordinates via a shared mutex. I don’t see correctness issues with this layout.

1815-1831: Replica checkpoint pruning on finalizeReplication is correct

Building refreshedSegmentNames from infos and then doing replicaMergedSegmentCheckpoints.removeIf(...) is a straightforward way to drop any pre‑copied merged checkpoints that have now been refreshed. This keeps replica state bounded without introducing extra locking, and is safe given the concurrent set semantics.

1952-1963: Replica merged-segment state clearing now correctly participates in the mutex protocol

Wrapping both setCurrentSegmentReplicationCheckpoint(...) and clearReplicaMergedSegmentState() in synchronized (cleanupReplicaMergedSegmentMutex) brings these writes into the same critical section as the deletion logic in cleanupRedundantPendingMergeSegment. This removes the earlier data‑race/visibility risk on currentSegmentReplicationCheckpoint and ensures consistent behavior under concurrency.

2056-2062: Always running primary checkpoint retention after referenced-segment publish is a good robustness improvement

Wrapping referencedSegmentsPublisher.publish(...) in a try with removeExpiredPrimaryMergedSegmentCheckpoints() in finally guarantees retention cleanup runs even if publish throws, which aligns with the goal of avoiding memory leaks from stale primary checkpoints. Since the cleanup method is pure in‑memory logic and cannot throw IOException, this doesn’t alter the exception surface of publishReferencedSegments().

2089-2098: Merged-segment publish now correctly tracks primary checkpoints and rolls back on failure

Computing the MergedSegmentCheckpoint, adding it to primaryMergedSegmentCheckpoints before publishing, and then removing it in the catch path if mergedSegmentPublisher.publish(...) throws prevents stale primary entries from lingering on failed publishes. This keeps the in‑memory checkpoint set consistent while still allowing retention cleanup to prune old, successfully‑published entries.

github-actions · 2025-12-15T07:07:34Z

❕ Gradle check result for 7ff4838: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)

1905-1911: Retention logic for primary merged-segment checkpoints

Assuming MergedSegmentCheckpoint#getCreatedTimeStamp() and DateUtils.toLong(Instant.now()) are both in nanoseconds since the epoch (as the comment states), nowNanosSinceEpoch - c.v2() with Duration.ofNanos(...).toMillis() is a correct way to compare against a millisecond retention value. If those units ever change, this is the key place that would need updating.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0bf1c09 and a11f367.

📒 Files selected for processing (1)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (9 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)

libs/common/src/main/java/org/opensearch/common/util/set/Sets.java (1)

Sets (57-185)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: gradle-check
GitHub Check: Analyze (java)
GitHub Check: detect-breaking-change
GitHub Check: precommit (25, macos-15-intel)
GitHub Check: precommit (25, windows-latest)
GitHub Check: precommit (21, macos-15)
GitHub Check: precommit (25, macos-15)
GitHub Check: precommit (21, ubuntu-24.04-arm)
GitHub Check: precommit (21, macos-15-intel)
GitHub Check: precommit (25, ubuntu-24.04-arm)
GitHub Check: precommit (25, ubuntu-latest)
GitHub Check: precommit (21, windows-2025, true)
GitHub Check: precommit (21, ubuntu-latest)
GitHub Check: precommit (21, windows-latest)
GitHub Check: assemble (21, ubuntu-24.04-arm)
GitHub Check: assemble (25, ubuntu-latest)
GitHub Check: assemble (21, ubuntu-latest)
GitHub Check: assemble (21, windows-latest)
GitHub Check: assemble (25, windows-latest)
GitHub Check: assemble (25, ubuntu-24.04-arm)

🔇 Additional comments (7)

server/src/main/java/org/opensearch/index/shard/IndexShard.java (7)

100-111: New time imports and primary/replica checkpoint fields look consistent

Use of DateUtils, Duration, and Instant for retention plus Sets.newConcurrentHashSet() and a dedicated cleanupReplicaMergedSegmentMutex give you thread-safe containers and a clear separation between primary and replica merged-segment state. No issues spotted here.

Also applies to: 230-231, 400-409

1815-1831: Batch removal of refreshed replica merged segments is safe and efficient

Collecting refreshedSegmentNames into a HashSet and then using replicaMergedSegmentCheckpoints.removeIf(...) is both clear and safe with the concurrent set; it avoids per-segment mutation while iterating infos. Behavior matches the intent of dropping pending merged segments once they’re visible after replication.

1874-1902: Replica cleanup now correctly guards against races with in‑flight segment replication

The two-phase approach (build pendingDeleteCheckpoints first, then re-check under cleanupReplicaMergedSegmentMutex against currentSegmentReplicationCheckpoint and only delete when there is no metadata overlap) properly prevents deleting files required by an ongoing segment replication. The locking discipline is consistent across setCurrentSegmentReplicationCheckpoint, clearReplicaMergedSegmentState, and this method.

1913-1940: Deprecated pending-merged API and primary checkpoint helpers are safe

Turning addPendingMergeSegmentCheckpoint into a no-op and getPendingMergedSegmentCheckpoints into Collections.emptySet() is a safe deprecation path that avoids NPEs for any lingering callers. The new add/remove/getPrimaryMergedSegmentCheckpoint* helpers encapsulate the <segmentName, createdTimeStamp> tuples cleanly and expose an unmodifiable view for tests.

1942-1964: Replica merged-segment state helpers and mutex usage look correct

addReplicaMergedSegmentCheckpoint writes into a concurrent set, and getReplicaMergedSegmentCheckpoints() returns an unmodifiable view for tests. setCurrentSegmentReplicationCheckpoint and clearReplicaMergedSegmentState now both synchronize on cleanupReplicaMergedSegmentMutex, matching the mutex used in cleanupRedundantPendingMergeSegment and fixing the earlier data-race/visibility concern.

2059-2063: Referenced-segments publishing now accounts for pre‑warmed primary segments and always runs retention

Wrapping referencedSegmentsPublisher.publish(this, computeReferencedSegmentsCheckpoint()) in try/finally ensures removeExpiredPrimaryMergedSegmentCheckpoints() runs even if publish throws, which is important for avoiding leaks. Adding both on-disk segment names (from SegmentInfos) and names from primaryMergedSegmentCheckpoints into segmentNames correctly informs replicas about still-referenced pre‑warmed segments.

Also applies to: 2075-2078

2092-2099: Rollback on failed merged-segment publish prevents stale primary checkpoints

Computing the MergedSegmentCheckpoint, adding it to primaryMergedSegmentCheckpoints, and then rolling it back in the catch block if mergedSegmentPublisher.publish(...) fails avoids leaving a stale primary checkpoint that could incorrectly influence computeReferencedSegmentsCheckpoint() until retention cleanup. This behavior aligns well with the new tracking model.

github-actions · 2025-12-15T09:05:22Z

✅ Gradle check result for a11f367: SUCCESS

guojialiang92 · 2025-12-15T09:50:26Z

Hi, @ashking94
I have already implemented improvements based on all the suggestions you provided. In the past few months, we have also fixed concurrency control issue in the production environment, which I have added to the description of the PR.

Hope you can continue to help review this PR, thank you. :)

guojialiang92 · 2025-12-23T03:16:36Z

@ashking94
Just to bump this up.

remove listAll from computeReferencedSegmentsCheckpoint

9b7e4a5

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

guojialiang92 requested review from a team, Bukhtawar, CEHENKLE, Rishikesh1159, VachaShah, anasalkouz, andrross, ashking94, cwperks, dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, owaiskazi19, reta, sachinpkale, saratvemulapalli, shwetathareja and sohami as code owners August 1, 2025 07:57

fix ut

0c06302

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

fix ut

253c15f

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

guojialiang92 mentioned this pull request Aug 1, 2025

Fix flakey test testOverflowDisabledAsynchronous #18891

Merged

3 tasks

Merge remote-tracking branch 'origin/main' into dev/optimize-redundan…

1c90ffc

…t-merged-segment-cleanup

update

9ba9cb6

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

handle exception

f2e0d8a

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

guojialiang92 force-pushed the dev/optimize-redundant-merged-segment-cleanup branch from 0764133 to f2e0d8a Compare December 12, 2025 16:30

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

server/src/main/java/org/opensearch/index/shard/IndexShard.java Show resolved Hide resolved

update

e78533c

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

coderabbitai bot reviewed Dec 13, 2025

View reviewed changes

update

7ff4838

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

guojialiang92 force-pushed the dev/optimize-redundant-merged-segment-cleanup branch from 8a513f5 to 7ff4838 Compare December 15, 2025 05:53

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

guojialiang92 added 2 commits December 15, 2025 15:24

update

0bf1c09

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

update

a11f367

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

Conversation

guojialiang92 commented Aug 1, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Main Changes

Concurrent Control

Summary

Related Issues

Check List

Summary by CodeRabbit

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

github-actions bot commented Aug 2, 2025

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 13, 2025

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

guojialiang92 commented Dec 15, 2025

Uh oh!

guojialiang92 commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guojialiang92 commented Aug 1, 2025 •

edited by coderabbitai bot

Loading